Re: crush changes via cli

2013-03-22 Thread Sage Weil
On Fri, 22 Mar 2013, Gregory Farnum wrote:
> I suspect users are going to easily get in trouble without a more
> rigid separation between multi-linked and single-linked buckets. It's
> probably best if anybody who's gone to the trouble of setting up a DAG
> can't wipe it out without being very explicit ? so for instance "move"
> should only work against a bucket with a single parent.

Good idea; I'll add that.

> Rather than
> defaulting to all ancestors, removals should (for multiply-linked
> buckets) require users to either specify a set of ancestors or to pass
> in a "--all" flag.

'rm' only works on an empty bucket, so I'm not sure there is much danger 
is removing all links (and the bucket) in that case?

> Also, I suspect that "rm" actually deletes the bucket while "unlink"
> simply removes it from all parents (but leaves it in the tree); that
> distinction might need to be a little stronger (or is possibly not
> appropriate to leave in the CLI?).

That's right.  The "remove" versus "unlink" verbs make that pretty clear 
to me, at least...  Are you suggesting this be clarified in the docs, or 
that the command set change?  I think once we settle on the CLI, John can 
make a pass through the crush docs and make sure these commands are 
explained.

> You mention that one of the commands "does nothing" under some
> circumstances ? does that mean there's no error? If a command can't be
> logically completed it should complain to the user, not just fail
> silently.

It returns -ENOTEMPTY; sorry, poor choice of words.  :)

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: crush changes via cli

2013-03-22 Thread Gregory Farnum
On Fri, Mar 22, 2013 at 3:38 PM, Sage Weil  wrote:
> There's a branch pending that lets you do the remainder of the most common
> crush map changes via teh cli.  The command set breaks down like so:
>
> Updating leaves (devices):
>
>   ceph osd crush set[ ...]
>   ceph osd crush add[ ...]
>   ceph osd crush create-or-move[ ...]
>
> These let you create, add, and move devices in teh map.  The different
> between add and set is that add will create an additional instance of the
> osd (leaf), while set will move the old instance.  This is useful for some
> configurations.
>
> The loc ... bits let you specify the 'where' part in teh form of key/value
> pairs, like 'host=foo rack=bar root=default'.  It will find the
> most-specific pair that matches an existing item, and create any
> intervening ancestors.  For example, if my map has only a root=default
> node (nothing else) and I do
>
>  ceph osd crush set osd.0 1 host=foo rack=myrack row=first root=default
>
> it will create teh row, rack, and host nodes, and then stick osd.0 inside
> host=foo.
>
> Create-or-move is similar to set except that it won't ever change teh
> weight of the device; only set the initial weight if it has to create it.
> This is used by the upstart hook so that it doesn't inadvertantly clobber
> changes the admin has made.
>
> The next set of commands adjust the map structure. Although people usually
> create a tree structure, in reality the crush map is a DAG (directed
> acyclic graph).
>
>
>   ceph osd crush rm  [ancestor]
>
> Will remove an osd or internal node from the, assuming there are no
> children.  With the optional ancestor arg, it will remove only instances
> under the given ancestor.  Otherwise, all instances are removed.  If it is
> a bucket and non-empty, it does nothing.
>
>   ceph osd crush unlink  [ancestor]
>
> Is similar, but will let you remove a (or all) link(s) to a bucket even if
> it is non-empty.
>
>   ceph osd crush move   [ ...]
>
> will unlink the bucket from its existing location(s) and link it in a new
> position.
>
>   ceph osd crush link   [ ...]
>
> Doesn't touch existing links, only adds a new one.
>
> Finally,
>
>   ceph osd crush add-bucket  
>
> is the one command that will create an internal node with no parent.
> Normally this is just used to create the root of the tree (e.g.,
> root=default).  Once it is there, then devices can be added beneath with
> it set, add, link, etc. and loc... bit will add any intervening ancestors
> that are missing.
>
> This maps cleanly on to the internal data model that CRUSH is using.  As
> long as it doesn't bend everyone's mind in uncomfortable ways, I'd like to
> stick with it (or something like it)... but if there is something here
> that seems wrong, let me know!

I suspect users are going to easily get in trouble without a more
rigid separation between multi-linked and single-linked buckets. It's
probably best if anybody who's gone to the trouble of setting up a DAG
can't wipe it out without being very explicit — so for instance "move"
should only work against a bucket with a single parent. Rather than
defaulting to all ancestors, removals should (for multiply-linked
buckets) require users to either specify a set of ancestors or to pass
in a "--all" flag.
Also, I suspect that "rm" actually deletes the bucket while "unlink"
simply removes it from all parents (but leaves it in the tree); that
distinction might need to be a little stronger (or is possibly not
appropriate to leave in the CLI?).

You mention that one of the commands "does nothing" under some
circumstances — does that mean there's no error? If a command can't be
logically completed it should complain to the user, not just fail
silently.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


crush changes via cli

2013-03-22 Thread Sage Weil
There's a branch pending that lets you do the remainder of the most common 
crush map changes via teh cli.  The command set breaks down like so:

Updating leaves (devices):

  ceph osd crush set[ ...]
  ceph osd crush add[ ...]
  ceph osd crush create-or-move[ ...]

These let you create, add, and move devices in teh map.  The different 
between add and set is that add will create an additional instance of the 
osd (leaf), while set will move the old instance.  This is useful for some 
configurations.

The loc ... bits let you specify the 'where' part in teh form of key/value 
pairs, like 'host=foo rack=bar root=default'.  It will find the 
most-specific pair that matches an existing item, and create any 
intervening ancestors.  For example, if my map has only a root=default 
node (nothing else) and I do

 ceph osd crush set osd.0 1 host=foo rack=myrack row=first root=default

it will create teh row, rack, and host nodes, and then stick osd.0 inside 
host=foo.

Create-or-move is similar to set except that it won't ever change teh 
weight of the device; only set the initial weight if it has to create it.  
This is used by the upstart hook so that it doesn't inadvertantly clobber 
changes the admin has made.

The next set of commands adjust the map structure. Although people usually 
create a tree structure, in reality the crush map is a DAG (directed 
acyclic graph).


  ceph osd crush rm  [ancestor]

Will remove an osd or internal node from the, assuming there are no 
children.  With the optional ancestor arg, it will remove only instances 
under the given ancestor.  Otherwise, all instances are removed.  If it is 
a bucket and non-empty, it does nothing.  

  ceph osd crush unlink  [ancestor]

Is similar, but will let you remove a (or all) link(s) to a bucket even if 
it is non-empty.  

  ceph osd crush move   [ ...]

will unlink the bucket from its existing location(s) and link it in a new 
position.

  ceph osd crush link   [ ...]

Doesn't touch existing links, only adds a new one.

Finally,

  ceph osd crush add-bucket  

is the one command that will create an internal node with no parent.  
Normally this is just used to create the root of the tree (e.g., 
root=default).  Once it is there, then devices can be added beneath with 
it set, add, link, etc. and loc... bit will add any intervening ancestors 
that are missing.

This maps cleanly on to the internal data model that CRUSH is using.  As 
long as it doesn't bend everyone's mind in uncomfortable ways, I'd like to 
stick with it (or something like it)... but if there is something here 
that seems wrong, let me know!

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Chris Mason (2013-03-22 14:07:05)
> [ mmap corruptions with leveldb and btrfs compression ]
> 
> I ran this a number of times with compression off and wasn't able to
> trigger problems.  With compress=lzo, I see errors on every run.
> 
> Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
> Run: ./mmap-trunc file_name
> 
> The basic idea is to create a 256MB file in steps.  Each step ftruncates
> the file larger, and then mmaps a region for writing.  It dirties some
> unaligned bytes (a little more than 8K), and then munmaps.
> 
> Then a verify stage goes back through the file to make sure the data we
> wrote is really there.  I'm using a simple rotating pattern of chars
> that compress very well.

Going through the code here, when I change the test to truncate once in
the very beginning, I still get errors.  So, it isn't an interaction
between mmap and truncate.  It must be a problem between lzo and mmap.

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-22 Thread Josh Durgin

On 03/22/2013 12:09 PM, Oliver Francke wrote:

Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but perhaps 
there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a "master" debian 6.0 
image with "format 2", made a snapshot, protected it, and made some clones.
Clones mounted with qemu-nbd, fiddled a bit with 
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 
secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this image. Started again, did 
some "apt-get update… install s/t…".
Shutdown -> rbd rollback -> startup again -> login -> install s/t else… filesystem showed 
"many" ex3-errors, fell into read-only mode, massive corruption.


This sounds like it might be a bug in rollback. Could you try cloning
and snapshotting again, but export the image before booting, and after
rolling back, and compare the md5sums?

Running the rollback with:

--debug-ms 1 --debug-rbd 20 --log-file rbd-rollback.log

might help too. Does your ceph.conf where you ran the rollback have
anything related to rbd_cache in it?


qemu config was with ":rbd_cache=false" if it matters. Above scenario is 
reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I will provide 
logfiles as needed.


It's unrelated, the other thread is an issue with the cache, which does
not cause corruption but triggers a crash.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-22 Thread Oliver Francke
Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but perhaps 
there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a "master" 
debian 6.0 image with "format 2", made a snapshot, protected it, and made some 
clones.
Clones mounted with qemu-nbd, fiddled a bit with 
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 
secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this image. Started 
again, did some "apt-get update… install s/t…".
Shutdown -> rbd rollback -> startup again -> login -> install s/t else… 
filesystem showed "many" ex3-errors, fell into read-only mode, massive 
corruption.

qemu config was with ":rbd_cache=false" if it matters. Above scenario is 
reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I will provide 
logfiles as needed.

Kind regards,

Oliver.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Latest 0.56.3 and qemu-1.4.0 and cloned VM-image producing massive fs-corruption, not crashing

2013-03-22 Thread Oliver Francke
Hi Josh, all,

I did not want to hijack the thread dealing with a crashing VM, but perhaps 
there are some common things.

Today I installed a fresh cluster with mkephfs, went fine, imported a "master" 
debian 6.0 image with "format 2", made a snapshot, protected it, and made some 
clones.
Clones mounted with qemu-nbd, fiddled a bit with 
IP/interfaces/hosts/net.rules…etc and cleanly unmounted, VM started, took 2 
secs and the VM was up n running. Cool.

Now an ordinary shutdown was performed, made a snapshot of this image. Started 
again, did some "apt-get update… install s/t…".
Shutdown -> rbd rollback -> startup again -> login -> install s/t else… 
filesystem showed "many" ex3-errors, fell into read-only mode, massive 
corruption.

qemu config was with ":rbd_cache=false" if it matters. Above scenario is 
reproducible, and as I stated out, no crash detected.

Perhaps it is in the same area as in the crash-thread, otherwise I will provide 
logfiles as needed.

Kind regards,

Oliver.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: github pull requests

2013-03-22 Thread Florian Haas
On Fri, Mar 22, 2013 at 12:15 AM, Gregory Farnum  wrote:
> I'm not sure that we handle enough incoming yet that the extra process
> weight of something like Gerrit or Launchpad is necessary over Github.
> What are you looking for in that system which Github doesn't provide?
> -Greg

Automated regression tests and gated commits come to mind. Gerrit
alone of course doesn't help with that, you'd probably want to
consider either running Jenkins, or hook the master merges up with
automatic teuthology runs.

Just my two cents, though.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-22 Thread Travis Rhoden
That's awesome Josh.  Thanks for looking into it.  Good luck with the fix!

 - Travis

On Fri, Mar 22, 2013 at 1:11 PM, Josh Durgin  wrote:
> I think I found the root cause based on your logs:
>
> http://tracker.ceph.com/issues/4531
>
> Josh
>
>
> On 03/20/2013 02:47 PM, Travis Rhoden wrote:
>>
>> Didn't take long to re-create with the detailed debugging (ms =  20).
>> I'm sending Josh a link to the gzip'd log off-list, I"m not sure if
>> the log will contain any CephX keys or anything like that.
>>
>> On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden  wrote:
>>>
>>> Thanks Josh.  I will respond when I have something useful!
>>>
>>> On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin 
>>> wrote:

 On 03/20/2013 01:19 PM, Josh Durgin wrote:
>
>
> On 03/20/2013 01:14 PM, Stefan Priebe wrote:
>>
>>
>> Hi,
>>
>>> In this case, they are format 2. And they are from cloned snapshots.
>>> Exactly like the following:
>>>
>>> # rbd ls -l -p volumes
>>> NAME SIZE
>>> PARENT   FMT PROT LOCK
>>> volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
>>> images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2
>>>
>>> I'm doing an OpenStack boot-from-volume setup.
>>
>>
>>
>> OK i've never used cloned snapshots so maybe this is the reason.
>>
 strange i've never seen this. Which qemu version?
>>>
>>>
>>>
>>> # qemu-x86_64 -version
>>> qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
>>> Fabrice Bellard
>>>
>>> that's coming from Ubuntu 12.04 apt repos.
>>
>>
>>
>> maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm
>> does
>> not exist anymore it was merged into qemu with 1.3 or 1.4.
>
>
>
> This particular problem won't be solved by upgrading qemu. It's a ceph
> bug. Disabling caching would work around the issue.
>
> Travis, could you get a log from qemu of this happening with:
>
> debug ms = 20
> debug objectcacher = 20
> debug rbd = 20
> log file = /path/writeable/by/qemu



 If it doesn't reproduce with those settings, try changing debug ms to 1
 instead of 20.


>   From those we can tell whether the issue is on the client side at
> least,
> and hopefully what's causing it.
>
> Thanks!
> Josh



>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: docs

2013-03-22 Thread John Wilkins
My meeting got cancelled today, so I'll work with Gary to get this resolved.

On Fri, Mar 22, 2013 at 11:18 AM, Dan Mick  wrote:
>
>
> On 03/22/2013 05:37 AM, Jerker Nyberg wrote:
>>
>>
>> There seem to be a missing argument to ceph osd lost (also in help for
>> the command).
>>
>> http://ceph.com/docs/master/rados/operations/control/#osd-subsystem
>>
>
> Indeed, it seems to be missing the id.  The CLI is getting a big rework
> right now, but the docs should be corrected.  Patch or file an issue, either
> way.
>
>
>> src/tools/ceph.cc
>> src/test/cli/ceph/help.t
>> doc/rados/operations/control.rst
>>
>> The documentation for development release packages is slightly confused.
>> Should it not refer to http://ceph.com/rpm-testing for development
>> release packages? (Also, the ceph-release package in the development
>> release does not refer to itself (in /etc/yum.repos.d/ceph.repo) but to
>> (http://ceph.com/rpms) packages.)
>>
>> http://ceph.com/docs/master/install/rpm/
>>
>>
>> Do you want patches?
>>
>> --jerker
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
John Wilkins
Senior Technical Writer
Intank
john.wilk...@inktank.com
(415) 425-9599
http://inktank.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: docs

2013-03-22 Thread Dan Mick



On 03/22/2013 05:37 AM, Jerker Nyberg wrote:


There seem to be a missing argument to ceph osd lost (also in help for
the command).

http://ceph.com/docs/master/rados/operations/control/#osd-subsystem



Indeed, it seems to be missing the id.  The CLI is getting a big rework 
right now, but the docs should be corrected.  Patch or file an issue, 
either way.



src/tools/ceph.cc
src/test/cli/ceph/help.t
doc/rados/operations/control.rst

The documentation for development release packages is slightly confused.
Should it not refer to http://ceph.com/rpm-testing for development
release packages? (Also, the ceph-release package in the development
release does not refer to itself (in /etc/yum.repos.d/ceph.repo) but to
(http://ceph.com/rpms) packages.)

http://ceph.com/docs/master/install/rpm/


Do you want patches?

--jerker
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
[ mmap corruptions with leveldb and btrfs compression ]

I ran this a number of times with compression off and wasn't able to
trigger problems.  With compress=lzo, I see errors on every run.

Compile: gcc -Wall -o mmap-trunc mmap-trunc.c
Run: ./mmap-trunc file_name

The basic idea is to create a 256MB file in steps.  Each step ftruncates
the file larger, and then mmaps a region for writing.  It dirties some
unaligned bytes (a little more than 8K), and then munmaps.

Then a verify stage goes back through the file to make sure the data we
wrote is really there.  I'm using a simple rotating pattern of chars
that compress very well.

I run it in batches of 100 with some memory pressure on the side:

for x in `seq 1 100` ; do (mmap-trunc f$x &) ; done

#define _FILE_OFFSET_BITS 64
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILE_SIZE ((loff_t)256 * 1024 * 1024)
/* make a painfully unaligned chunk size */
#define CHUNK_SIZE (8192 + 932)

#define mmap_align(x) (((x) + 4095) & ~4095)

char *file_name = NULL;

void mmap_one_chunk(int fd, loff_t *cur_size, unsigned char *file_buf)
{
int ret;
loff_t new_size = *cur_size + CHUNK_SIZE;
loff_t pos = *cur_size;
unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;
char val = file_buf[0];
char *p;
int extra;

/* step one, truncate out a hole */
ret = ftruncate(fd, new_size);
if (ret) {
perror("truncate");
exit(1);
}

if (val == 0 || val == 'z')
val = 'a';
else
val++;

memset(file_buf, val, CHUNK_SIZE);

extra = pos & 4095;
p = mmap(0, map_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
 pos - extra);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
memcpy(p + extra, file_buf, CHUNK_SIZE);

ret = munmap(p, map_size);
if (ret) {
perror("munmap");
exit(1);
}
*cur_size = new_size;
}

void check_chunks(int fd)
{
char *p;
loff_t checked = 0;
char val = 'a';
int i;
int errors = 0;
int ret;
int extra;
unsigned long map_size = mmap_align(CHUNK_SIZE) + 4096;

fprintf(stderr, "checking chunks\n");
while (checked < FILE_SIZE) {
extra = checked & 4095;
p = mmap(0, map_size, PROT_READ,
 MAP_SHARED, fd, checked - extra);
if (p == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < CHUNK_SIZE; i++) {
if (p[i + extra] != val) {
fprintf(stderr, "%s: bad val %x wanted %x 
offset 0x%llx\n",
file_name, p[i + extra], val,
(unsigned long long)checked + i);
errors++;
}
}
if (val == 'z')
val = 'a';
else
val++;
ret = munmap(p, map_size);
if (ret) {
perror("munmap");
exit(1);
}
checked += CHUNK_SIZE;
}
printf("%s found %d errors\n", file_name, errors);
if (errors)
exit(1);
}

int main(int ac, char **av)
{
unsigned char *file_buf;
loff_t pos = 0;
int ret;
int fd;

if (ac < 2) {
fprintf(stderr, "usage: mmap-trunc filename\n");
exit(1);
}

ret = posix_memalign((void **)&file_buf, 4096, CHUNK_SIZE);
if (ret) {
perror("cannot allocate memory\n");
exit(1);
}

file_buf[0] = 0;

file_name = av[1];

fprintf(stderr, "running test on %s\n", file_name);

unlink(file_name);
fd = open(file_name, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}

fprintf(stderr, "writing chunks\n");
while (pos < FILE_SIZE) {
mmap_one_chunk(fd, &pos, file_buf);
}
check_chunks(fd);
return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Sage Weil
On Fri, 22 Mar 2013, Chris Mason wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
> > On Mar 22, 2013, Chris Mason  wrote:
> > 
> > > Are you using compression in btrfs or just in leveldb?
> > 
> > btrfs lzo compression.
> 
> Perfect, I'll focus on that part of things.
> 
> > 
> > > I'd like to take snapshots out of the picture for a minute.
> > 
> > That's understandable, I guess, but I don't know that anyone has ever
> > got the problem without snapshots.  I mean, even when the master copy of
> > the database got corrupted, snapshots of the subvol containing it were
> > being taken every now and again, because that's the way ceph works.
> 
> Hopefully Sage can comment, but the basic idea is that if you snapshot a
> database file the db must participate.  If it doesn't, it really is the
> same effect as crashing the box.
> 
> Something is definitely broken if we're corrupting the source files
> (either with or without snapshots), but avoiding incomplete writes in
> the snapshot files requires synchronization with the db.

In this case, we quiesce write activity, call leveldb's sync(), take the 
snapshot, and then continue.

(FWIW, this isn't the first time we've heard about leveldb corruption, but 
in each case we've looked into the user had the btrfs compression 
enabled so I suspect that's the right avenue of investigation!)

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
In this case, I think Alexandre is scanning for zeros in the file.   The
incomplete writes will definitely show that.

-chris

Quoting Samuel Just (2013-03-22 13:06:41)
> Incomplete writes for leveldb should just result in lost updates, not
> corruption.  Also, we do stop writes before the snapshot is initiated
> so there should be no in-progress writes to leveldb other than leveldb
> compaction (though that might be something to investigate).
> -Sam
> 
> On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason  wrote:
> > Quoting Alexandre Oliva (2013-03-22 10:17:30)
> >> On Mar 22, 2013, Chris Mason  wrote:
> >>
> >> > Are you using compression in btrfs or just in leveldb?
> >>
> >> btrfs lzo compression.
> >
> > Perfect, I'll focus on that part of things.
> >
> >>
> >> > I'd like to take snapshots out of the picture for a minute.
> >>
> >> That's understandable, I guess, but I don't know that anyone has ever
> >> got the problem without snapshots.  I mean, even when the master copy of
> >> the database got corrupted, snapshots of the subvol containing it were
> >> being taken every now and again, because that's the way ceph works.
> >
> > Hopefully Sage can comment, but the basic idea is that if you snapshot a
> > database file the db must participate.  If it doesn't, it really is the
> > same effect as crashing the box.
> >
> > Something is definitely broken if we're corrupting the source files
> > (either with or without snapshots), but avoiding incomplete writes in
> > the snapshot files requires synchronization with the db.
> >
> > -chris
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Latest bobtail branch still crashing KVM VMs in bh_write_commit()

2013-03-22 Thread Josh Durgin

I think I found the root cause based on your logs:

http://tracker.ceph.com/issues/4531

Josh

On 03/20/2013 02:47 PM, Travis Rhoden wrote:

Didn't take long to re-create with the detailed debugging (ms =  20).
I'm sending Josh a link to the gzip'd log off-list, I"m not sure if
the log will contain any CephX keys or anything like that.

On Wed, Mar 20, 2013 at 4:39 PM, Travis Rhoden  wrote:

Thanks Josh.  I will respond when I have something useful!

On Wed, Mar 20, 2013 at 4:32 PM, Josh Durgin  wrote:

On 03/20/2013 01:19 PM, Josh Durgin wrote:


On 03/20/2013 01:14 PM, Stefan Priebe wrote:


Hi,


In this case, they are format 2. And they are from cloned snapshots.
Exactly like the following:

# rbd ls -l -p volumes
NAME SIZE
PARENT   FMT PROT LOCK
volume-099a6d74-05bd-4f00-a12e-009d60629aa8 5120M
images/b8bdda90-664b-4906-86d6-dd33735441f2@snap   2

I'm doing an OpenStack boot-from-volume setup.



OK i've never used cloned snapshots so maybe this is the reason.


strange i've never seen this. Which qemu version?



# qemu-x86_64 -version
qemu-x86_64 version 1.0 (qemu-kvm-1.0), Copyright (c) 2003-2008
Fabrice Bellard

that's coming from Ubuntu 12.04 apt repos.



maybe you should try qemu 1.4 there are a LOT of bugfixes. qemu-kvm does
not exist anymore it was merged into qemu with 1.3 or 1.4.



This particular problem won't be solved by upgrading qemu. It's a ceph
bug. Disabling caching would work around the issue.

Travis, could you get a log from qemu of this happening with:

debug ms = 20
debug objectcacher = 20
debug rbd = 20
log file = /path/writeable/by/qemu



If it doesn't reproduce with those settings, try changing debug ms to 1
instead of 20.



  From those we can tell whether the issue is on the client side at least,
and hopefully what's causing it.

Thanks!
Josh




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread David Sterba
On Fri, Mar 22, 2013 at 10:26:59AM -0400, Chris Mason wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
> > On Mar 22, 2013, Chris Mason  wrote:
> > 
> > > Are you using compression in btrfs or just in leveldb?
> > 
> > btrfs lzo compression.
> 
> Perfect, I'll focus on that part of things.

> > > I'd like to take snapshots out of the picture for a minute.

I've reproduced this without compression, with autodefrag on. The test
was using snapshots (ie. the unmmodified versino) and ended with

1087 blocks, 4316779 total size
snaptest.268/ca snaptest.268/db differ: char 4245170, line 16

after a few minutes.

Before that, I was running the NOSNAPS mode for many-minutes (up to 50k
rounds) without a reported problem.

There was the same 'make clean && make -j 32' kernel compilation running
in parallel, the box has 8 cpus, 4GB ram. Watching 'free' showed the
memory going up to a few gigs and down to ~130MB.


david
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Samuel Just
Incomplete writes for leveldb should just result in lost updates, not
corruption.  Also, we do stop writes before the snapshot is initiated
so there should be no in-progress writes to leveldb other than leveldb
compaction (though that might be something to investigate).
-Sam

On Fri, Mar 22, 2013 at 7:26 AM, Chris Mason  wrote:
> Quoting Alexandre Oliva (2013-03-22 10:17:30)
>> On Mar 22, 2013, Chris Mason  wrote:
>>
>> > Are you using compression in btrfs or just in leveldb?
>>
>> btrfs lzo compression.
>
> Perfect, I'll focus on that part of things.
>
>>
>> > I'd like to take snapshots out of the picture for a minute.
>>
>> That's understandable, I guess, but I don't know that anyone has ever
>> got the problem without snapshots.  I mean, even when the master copy of
>> the database got corrupted, snapshots of the subvol containing it were
>> being taken every now and again, because that's the way ceph works.
>
> Hopefully Sage can comment, but the basic idea is that if you snapshot a
> database file the db must participate.  If it doesn't, it really is the
> same effect as crashing the box.
>
> Something is definitely broken if we're corrupting the source files
> (either with or without snapshots), but avoiding incomplete writes in
> the snapshot files requires synchronization with the db.
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Alexandre Oliva (2013-03-22 10:17:30)
> On Mar 22, 2013, Chris Mason  wrote:
> 
> > Are you using compression in btrfs or just in leveldb?
> 
> btrfs lzo compression.

Perfect, I'll focus on that part of things.

> 
> > I'd like to take snapshots out of the picture for a minute.
> 
> That's understandable, I guess, but I don't know that anyone has ever
> got the problem without snapshots.  I mean, even when the master copy of
> the database got corrupted, snapshots of the subvol containing it were
> being taken every now and again, because that's the way ceph works.

Hopefully Sage can comment, but the basic idea is that if you snapshot a
database file the db must participate.  If it doesn't, it really is the
same effect as crashing the box.

Something is definitely broken if we're corrupting the source files
(either with or without snapshots), but avoiding incomplete writes in
the snapshot files requires synchronization with the db.

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Alexandre Oliva
On Mar 22, 2013, Chris Mason  wrote:

> Are you using compression in btrfs or just in leveldb?

btrfs lzo compression.

> I'd like to take snapshots out of the picture for a minute.

That's understandable, I guess, but I don't know that anyone has ever
got the problem without snapshots.  I mean, even when the master copy of
the database got corrupted, snapshots of the subvol containing it were
being taken every now and again, because that's the way ceph works.
Even back when I noticed corruption of firefox _CACHE_* files, snapshots
taken for archival were involved.  So, unless the program happens to
trigger the problem with the -DNOSNAPS option about as easily as it did
without it, I guess we may not have a choice but to keep snapshots in
the picture.

> We need some way to synchronize the leveldb with snapshotting

I purposefully refrained from doing that, because AFAICT ceph doesn't do
that.  Once I failed to trigger the problem with Sync calls, and
determined ceph only syncs the leveldb logs before taking its snapshots,
I went without syncing and finally succeeded in triggering the bug in
snapshots, by simulating very similar snapshotting and mmaping
conditions to those generated by ceph.  I haven't managed to trigger the
corruption of the master subvol yet with the test program, but I already
knew its corruption didn't occur as often as that of the snapshots, and
since it smells like two slightly different symptoms of the same bug, I
decided to leave the test program at that.

-- 
Alexandre Oliva, freedom fighterhttp://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist  Red Hat Brazil Compiler Engineer
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] ceph 0.59 cephx problem

2013-03-22 Thread Joao Eduardo Luis

(Re-CC'ing the list)

On 03/22/2013 01:36 PM, Steffen Thorhauer wrote:

I was upgrading from 0.58 to ceph version 0.59 
(cbae6a435c62899f857775f66659de052fb0e759)
Upgrading from 0.57 to 0.58 was an easy one, so I was suprised with the problems


v0.59 is the first dev release with a major monitor rework.  We've 
tested it thoroughly over the past weeks, but different usages tend to 
trigger different behaviours, so you might just have hit one of those 
buggers.



It seems to me, that I make an fatal error, that I dont understand.
I had 5 working mons (mon.{0-4]). After the upgrade of the first node I
lost the mon.4 with the cephx error. Then I upgraded all of the nodes and
I lost the mon.0 with the starting error.


The v0.59 monitors is unable to communicate with the <=0.58 monitors, so 
that's likely why the monitor appeared to be lost: you would need at 
least a majority of monitors on v0.59 so they could form a quorum.



After some restarts it looks like the other mons lost any quorum
so ceph -s or any kind of ceph commands didn't work anymore.


As long as you have a majority of monitors running v0.59, they ought to 
be able to form a quorum.  If they didn't, then something weird must 
have happened and logs would be much appreciated!



So I made today the decision to reinstall the test "cluster".


You decided to go back to v0.58, is that it?  Regardless, if you have 
logs that could provide some insight into what happened, we'd really 
appreciate it.


Thanks!

  -Joao



-Steffen

Btw. ceph rbd, adding/removing osds works great.


On Fri, Mar 22, 2013 at 10:01:10AM +, Joao Eduardo Luis wrote:
On 03/21/2013 03:47 PM, Steffen Thorhauer wrote:

I think, I was impatient and should wait for the v.59 announcement. It
seems I should upgrading all monitors.
  After upgrading all nodes I have on 2 monitors errors like:
=== mon.0 ===
Starting Ceph mon.0 on u124-161-ceph...
mon fs missing 'monmap/latest' and 'mkfs/monmap'
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i 0 --pid-file
/var/run/ceph/mon.0.pid -c /etc/ceph/ceph.conf '

Steffen


Which version are you upgrading from?

Also, could you provide us with some logs of those monitors with 'debug
mon = 20' ?

   -Joao




On 03/21/2013 02:22 PM, Steffen Thorhauer wrote:

Hi,
I just upgraded one node of my ceph "cluster". I wanted upgrade node
after node.
osd on this node  has no problem. but the mon (mon.4) has
authorization problems.
I did'nt change any config, just made an  apt-get upgrade .
ceph -s
   health HEALTH_WARN 1 mons down, quorum 0,1,2,3 0,1,2,3
   monmap e2: 5 mons at
{0=10.37.124.161:6789/0,1=10.37.124.162:6789/0,2=10.37.124.163:6789/0,3=10.37.124.164:6789/0,4=10.37.124.167:6789/0},
election epoch 162, quorum 0,1,2,3 0,1,2,3
   osdmap e4839: 16 osds: 16 up, 16 in
pgmap v195213: 3144 pgs: 3144 active+clean; 255 GB data, 820 GB
used, 778 GB / 1599 GB avail
   mdsmap e54723: 1/1/1 up {0=0=up:active}, 3 up:standby


but the mon.4 log file look like:

2013-03-21 12:45:15.701747 7f45412c6780  2 mon.4@-1(probing) e2 init
2013-03-21 12:45:15.702051 7f45412c6780 10 mon.4@-1(probing) e2 bootstrap
2013-03-21 12:45:15.702094 7f45412c6780 10 mon.4@-1(probing) e2
unregister_cluster_logger - not registered
2013-03-21 12:45:15.702121 7f45412c6780 10 mon.4@-1(probing) e2
cancel_probe_timeout (none scheduled)
2013-03-21 12:45:15.702147 7f45412c6780  0 mon.4@-1(probing) e2 my
rank is now 4 (was -1)
2013-03-21 12:45:15.702190 7f45412c6780 10 mon.4@4(probing) e2 reset_sync
2013-03-21 12:45:15.702213 7f45412c6780 10 mon.4@4(probing) e2 reset
2013-03-21 12:45:15.702238 7f45412c6780 10 mon.4@4(probing) e2
timecheck_finish
2013-03-21 12:45:15.702286 7f45412c6780 10 mon.4@4(probing) e2
cancel_probe_timeout (none scheduled)
2013-03-21 12:45:15.702312 7f45412c6780 10 mon.4@4(probing) e2
reset_probe_timeout 0x24d6580 after 2 seconds
2013-03-21 12:45:15.702387 7f45412c6780 10 mon.4@4(probing) e2 probing
other monitors
2013-03-21 12:45:15.703459 7f453a15f700 10 mon.4@4(probing) e2
ms_get_authorizer for mon
2013-03-21 12:45:15.703641 7f453a15f700 10 cephx: build_service_ticket
service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon.
2013-03-21 12:45:15.703642 7f453a361700 10 mon.4@4(probing) e2
ms_get_authorizer for mon
2013-03-21 12:45:15.703694 7f453a361700 10 cephx: build_service_ticket
service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon.
2013-03-21 12:45:15.703869 7f453a260700 10 mon.4@4(probing) e2
ms_get_authorizer for mon
2013-03-21 12:45:15.703957 7f453a260700 10 cephx: build_service_ticket
service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon.
2013-03-21 12:45:15.704244 7f453a05e700 10 mon.4@4(probing) e2
ms_get_authorizer for mon
2013-03-21 12:45:15.704306 7f453a05e700 10 cephx: build_service_ticket
service mon secret_id 18446744073709551615 ticket_info.ticket.name=mon.
2013-03-21 12:45:15.704323 7f453a361700  0 cephx: verify_reply
coudln't decrypt with error: error decoding block for decryption
2013-0

docs

2013-03-22 Thread Jerker Nyberg


There seem to be a missing argument to ceph osd lost (also in help for the 
command).


http://ceph.com/docs/master/rados/operations/control/#osd-subsystem

src/tools/ceph.cc
src/test/cli/ceph/help.t
doc/rados/operations/control.rst


The documentation for development release packages is slightly confused. 
Should it not refer to http://ceph.com/rpm-testing for development release 
packages? (Also, the ceph-release package in the development release does 
not refer to itself (in /etc/yum.repos.d/ceph.repo) but to 
(http://ceph.com/rpms) packages.)


http://ceph.com/docs/master/install/rpm/


Do you want patches?

--jerker
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corruption of active mmapped files in btrfs snapshots

2013-03-22 Thread Chris Mason
Quoting Alexandre Oliva (2013-03-22 01:27:42)
> On Mar 21, 2013, Chris Mason  wrote:
> 
> > Quoting Chris Mason (2013-03-21 14:06:14)
> >> With mmap the kernel can pick any given time to start writing out dirty
> >> pages.  The idea is that if the application makes more changes the page
> >> becomes dirty again and the kernel writes it again.
> 
> That's the theory.  But what if there's some race between the time the
> page is frozen for compressing and the time it's marked as clean, or
> it's marked as clean after it's further modified, or a subsequent write
> to the same page ends up overridden by the background compression of the
> old contents of the page?  These are all possibilities that come to mind
> without knowing much about btrfs inner workings.

Definitely, there is a lot of room for racing.  Are you using
compression in btrfs or just in leveldb?

> 
> >> So the question is, can you trigger this without snapshots being done
> >> at all?
> 
> I haven't tried, but I now have a program that hit the error condition
> while taking snapshots in background with small time perturbations to
> increase the likelihood of hitting a race condition at the exact time.
> It uses leveldb's infrastructure for the mmapping, but it shouldn't be
> too hard to adapt it so that it doesn't.
> 
> > So my test program creates an 8GB file in chunks of 1MB each.
> 
> That's probably too large a chunk to write at a time.  The bug is
> exercised with writes slightly smaller than a single page (although
> straddling across two consecutive pages).
> 
> This half-baked test program (hereby provided under the terms of the GNU
> GPLv3+) creates a btrfs subvolume and two files in it: one in which I/O
> will be performed with write()s, another that will get the same data
> appended with leveldb's mmap-based output interface.  Random block
> sizes, as well as milli and microsecond timing perturbations, are read
> from /dev/urandom, and the rest of the output buffer is filled with
> (char)1.
> 
> The test that actually failed (on the first try!, after some other
> variations that didn't fail) didn't have any of the #ifdef options
> enabled (i.e., no -D* flags during compilation), but it triggered the
> exact failure observed with ceph: zeros at the end of a page where there
> should have been nonzero data, followed by nonzero data on the following
> page!  That was within snapshots, not in the main subvol, but hopefully
> it's the same problem, just a bit harder to trigger.

I'd like to take snapshots out of the picture for a minute.  We need
some way to synchronize the leveldb with snapshotting because the
snapshot is basically the same thing as a crash from a db point of view.

Corrupting the main database file is a much different (and bigger)
problem.

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html