Re: new OSD re-using old OSD id fails to boot

2015-12-09 Thread David Zafman


On 12/9/15 2:39 AM, Wei-Chung Cheng wrote:

Hi Loic,

I try to reproduce this problem on my CentOS7.
I can not do the same issue.
This is my version:
ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534)
Would you describe more detail?


Hi David, Sage,

In most of time, when we found the osd failure, the OSD is already in
`out` state.
It could not avoid the redundant data movement unless we could set the
osd noout when failure.
Is it right? (Means if OSD go into `out` state, it will make some
redundant data movement)
Yes, one case would be that during the 5 minute down window of an OSD 
disk failure, the noout flag can be set if a spare disk is available.  
Another scenario would be a bad SMART status or noticing EIO errors from 
a disk prompting a replacement.  So if a spare disk is already installed 
or you have hot swappable drives, it would be nice to replace the drive 
and let recovery copy back all the data that should be there.  Using 
noout would be critical to this effort.


I don't understand why Sage suggests below that a down+out phase would 
be required during the replacement.


Could we try the traditional spare behavior? (Set some disks backup
and auto replace the broken device?)

That can replace the failure osd before it go into the `out` state.
Or we could always set the osd noout?

In fact, I think these is a different problems between David and Loic.
(these two problems are the same import :p

If you have any problems, feel free to let me know.

thanks!!
vicente


2015-12-09 10:50 GMT+08:00 Sage Weil <sw...@redhat.com>:

On Tue, 8 Dec 2015, David Zafman wrote:

Remember I really think we want a disk replacement feature that would retain
the OSD id so that it avoids unnecessary data movement.  See tracker
http://tracker.ceph.com/issues/13732

Yeah, I totally agree.  We just need to form an opinion on how... probably
starting with the user experience.  Ideally we'd go from up + in to down +
in to down + out, then pull the drive and replace, and then initialize a

Here 

new OSD with the same id... and journal partition.  Something like

   ceph-disk recreate id=N uuid=U 

I.e., it could use the uuid (which the cluster has in the OSDMap) to find
(and re-use) the journal device.

For a journal failure it'd probably be different.. but maybe not?

Any other ideas?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: new OSD re-using old OSD id fails to boot

2015-12-08 Thread David Zafman


Remember I really think we want a disk replacement feature that would 
retain the OSD id so that it avoids unnecessary data movement.  See 
tracker http://tracker.ceph.com/issues/13732


David

On 12/5/15 8:49 AM, Loic Dachary wrote:

Hi Sage,

The problem described at "new OSD re-using old OSD id fails to boot" 
http://tracker.ceph.com/issues/13988 consistently fails the ceph-disk suite on master. I 
wonder if it could be a side effect of the recent optimizations introduced in the monitor 
?

Cheers



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 答复: [ceph-users] How long will the logs be kept?

2015-12-07 Thread David Zafman


dout() is used for an OSD to log information about what it is doing 
locally and might become very chatty.  It is saved on the local nodes 
disk only.


clog is the cluster log and is used for major events that should be 
known by the administrator (see ceph -w).  Clog should be used sparingly 
as it sends the messages to the monitor.


David

On 12/3/15 4:36 AM, Wukongming wrote:

OK! One more question. Do you know why ceph has 2 ways outputting logs(dout && 
clog). Cause I find dout is more helpful than clog, Did ceph use clog first, and dout 
added for new version?

-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 ONEStor

-邮件原件-
发件人: Jan Schermer [mailto:j...@schermer.cz]
发送时间: 2015年12月3日 16:58
收件人: wukongming 12019 (RD)
抄送: huang jun; ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com
主题: Re: [ceph-users] How long will the logs be kept?

You can setup logrotate however you want - not sure what the default is for 
your distro.
Usually logrotate doesn't touch files that are smaller than some size even if 
they are old. It will also not delete logs for OSDs that no longer exist.

Ceph itself has nothing to do with log rotation, logrotate does the work. Ceph 
packages likely contain default logrotate rules for the logs but you can edit 
them to your liking.

Jan


On 03 Dec 2015, at 09:38, Wukongming  wrote:

Yes, I can find ceph of rotate configure file in the directory of 
/etc/logrotate.d.
Also, I find sth. Weird.

drwxr-xr-x  2 root root   4.0K Dec  3 14:54 ./
drwxrwxr-x 19 root syslog 4.0K Dec  3 13:33 ../
-rw---  1 root root  0 Dec  2 06:25 ceph.audit.log
-rw---  1 root root85K Nov 25 09:17 ceph.audit.log.1.gz
-rw---  1 root root   228K Dec  3 16:00 ceph.log
-rw---  1 root root28K Dec  3 06:23 ceph.log.1.gz
-rw---  1 root root   374K Dec  2 06:22 ceph.log.2.gz
-rw-r--r--  1 root root   4.3M Dec  3 16:01 ceph-mon.wkm01.log
-rw-r--r--  1 root root   561K Dec  3 06:25 ceph-mon.wkm01.log.1.gz
-rw-r--r--  1 root root   2.2M Dec  2 06:25 ceph-mon.wkm01.log.2.gz
-rw-r--r--  1 root root  0 Dec  2 06:25 ceph-osd.0.log
-rw-r--r--  1 root root992 Dec  1 09:09 ceph-osd.0.log.1.gz
-rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.2.log
-rw-r--r--  1 root root   2.3K Dec  2 10:50 ceph-osd.2.log.1.gz
-rw-r--r--  1 root root27K Dec  1 10:31 ceph-osd.2.log.2.gz
-rw-r--r--  1 root root13K Dec  3 10:23 ceph-osd.5.log
-rw-r--r--  1 root root   1.6K Dec  2 09:57 ceph-osd.5.log.1.gz
-rw-r--r--  1 root root22K Dec  1 09:51 ceph-osd.5.log.2.gz
-rw-r--r--  1 root root19K Dec  3 10:51 ceph-osd.8.log
-rw-r--r--  1 root root18K Dec  2 10:50 ceph-osd.8.log.1
-rw-r--r--  1 root root   261K Dec  1 13:54 ceph-osd.8.log.2

I deployed ceph cluster on Nov 21, from that day to Dec.1, I mean the continue 
10 days' logs were compressed into one file, it is not what I want.
Does any OP affect log compressing?

Thanks!
Kongming Wu
-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 ONEStor

-邮件原件-
发件人: huang jun [mailto:hjwsm1...@gmail.com]
发送时间: 2015年12月3日 13:19
收件人: wukongming 12019 (RD)
抄送: ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com
主题: Re: How long will the logs be kept?

it will rotate every week by default, you can see the logrotate file
/etc/ceph/logrotate.d/ceph

2015-12-03 12:37 GMT+08:00 Wukongming :

Hi ,All
Is there anyone who knows How long or how many days will the logs.gz 
(mon/osd/mds)be kept, maybe before flushed?

-
wukongming ID: 12019
Tel:0571-86760239
Dept:2014 UIS2 OneStor

-
-
---
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from
H3C, which is intended only for the person or entity whose address is
listed above. Any use of the information contained herein in any way
(including, but not limited to, total or partial disclosure,
reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error,
please notify the sender by phone or email immediately and delete it!



--
thanks
huangjun
___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!tml=


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Error handling during recovery read

2015-12-04 Thread David Zafman



I can't remember the details now, but I know that recovery needed 
additional work.   If it were a simple fix

I would have done it when implementing that code.

I found this bug related to recovery and ec errors 
(http://tracker.ceph.com/issues/13493)
BUG #13493: osd: for ec, cascading crash during recovery if one shard is 
corrupted


David

On 12/4/15 2:03 AM, Markus Blank-Burian wrote:

Hi David,

  


I am using ceph 9.2.0 with an erasure coded pool and have some problems with
missing objects.

  


Reads for degraded/backfilling objects on an EC pool, which detect an error
(-2 in my case) seem to be aborted immediately instead of reading from the
remaining shards. Why is there an explicit check for "!rop.for_recovery" in
ECBackend::handle_sub_read_reply? Would it be possible to remove this check
and let the recovery read be completed from the remaining good shards?

  


Markus

  





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD replacement feature

2015-11-23 Thread David Zafman


That is correct.  The goal is to only refill the replacement OSD disk.  
Otherwise, if the OSD is only down for less than 
mon_osd_down_out_interval (5 min default) or noout is set, no other data 
movement would occur.


David

On 11/23/15 8:45 PM, Wei-Chung Cheng wrote:

2015-11-21 1:54 GMT+08:00 David Zafman <dzaf...@redhat.com>:

There are two reasons for having a ceph-disk replace feature.

1. To simplify the steps required to replace a disk
2. To allow a disk to be replaced proactively without causing any data
movement.

Hi David,

It good to without causing any data movement when we want to replaced
failure osd.

But I don't have any idea to complete it, could you give some opinions?

I though if we want to replace failure we must move the object data on
failure osd to new(replacement) osd?

Or I got some misunderstanding?

thanks!!!
vicente


So keeping the osd id the same is required and is what motivated the feature
for me.

David


On 11/20/15 3:38 AM, Sage Weil wrote:

On Fri, 20 Nov 2015, Wei-Chung Cheng wrote:

Hi Loic and cephers,

Sure, I have time to help (comment) on this feature replace a disk.
This is a useful feature to handle disk failure :p

An simple step is described on http://tracker.ceph.com/issues/13732 :
1. set noout flag - if the broken osd is primary osd, could we handle
well?
2. stop osd daemon and we need to wait the osd actually down. (or
maybe use deactivate option with ceph-disk)

these two above step seems OK.
about handle crush map, should we remove the broken osd out?
If we do that, why we set noout flag? It still trigger re-balance
after we remove osd from crushmap.

Right--I think you generally want to do either one or the other:

1) mark osd out, leave failed disk in place.  or, replace with new disk
that re-uses the same osd id.

or,

2) remove osd from crush map.  replace with new disk (which gets new osd
id).

I think re-using the osd id is awkward currently, so doing 1 and replacing
the disk ends up moving data twice.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD replacement feature

2015-11-20 Thread David Zafman


There are two reasons for having a ceph-disk replace feature.

1. To simplify the steps required to replace a disk
2. To allow a disk to be replaced proactively without causing any data 
movement.


So keeping the osd id the same is required and is what motivated the 
feature for me.


David

On 11/20/15 3:38 AM, Sage Weil wrote:

On Fri, 20 Nov 2015, Wei-Chung Cheng wrote:

Hi Loic and cephers,

Sure, I have time to help (comment) on this feature replace a disk.
This is a useful feature to handle disk failure :p

An simple step is described on http://tracker.ceph.com/issues/13732 :
1. set noout flag - if the broken osd is primary osd, could we handle well?
2. stop osd daemon and we need to wait the osd actually down. (or
maybe use deactivate option with ceph-disk)

these two above step seems OK.
about handle crush map, should we remove the broken osd out?
If we do that, why we set noout flag? It still trigger re-balance
after we remove osd from crushmap.

Right--I think you generally want to do either one or the other:

1) mark osd out, leave failed disk in place.  or, replace with new disk
that re-uses the same osd id.

or,

2) remove osd from crush map.  replace with new disk (which gets new osd
id).

I think re-using the osd id is awkward currently, so doing 1 and replacing
the disk ends up moving data twice.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pg scrub check problem

2015-10-28 Thread David Zafman


Initiating a manual deep-scrub like you are doing should always run.

The command you are running doesn't report any information it just 
initiates a background process.  If you follow the command with ceph -w 
you'll see what is happening:


After I corrupted one of my replicas I see this.

$ ceph pg deep-scrub 1.6; ceph -w
instructing pg 1.6 on osd.3 to deep-scrub
cluster 8528c83b-0ff9-479c-af76-fc0ac5c595d3
 health HEALTH_OK
 monmap e1: 1 mons at {a=127.0.0.1:6789/0}
election epoch 2, quorum 0 a
 osdmap e14: 4 osds: 4 up, 4 in
flags sortbitwise
  pgmap v29: 16 pgs, 2 pools, 1130 bytes data, 1 objects
83917 MB used, 30311 MB / 117 GB avail
  16 active+clean

2015-10-28 12:23:17.724011 mon.0 [INF] from='client.? 
127.0.0.1:0/3672629479' entity='client.admin' cmd=[{"prefix": "pg 
deep-scrub", "pgid": "1.6"}]: dispatch
2015-10-28 12:23:19.787756 mon.0 [INF] pgmap v30: 16 pgs: 1 
active+clean+inconsistent, 15 active+clean; 1130 bytes data, 83917 MB 
used, 30310 MB / 117 GB avail

2015-10-28 12:23:18.274239 osd.3 [INF] 1.6 deep-scrub starts
2015-10-28 12:23:18.277332 osd.3 [ERR] 1.6 shard 2: soid 
1/7fc1f406/foo/head data_digest 0xe84d3cdc != known data_digest 
0x74d68469 from auth shard 0, size 7 != known size 1130
2015-10-28 12:23:18.277546 osd.3 [ERR] 1.6 deep-scrub 0 missing, 1 
inconsistent objects

2015-10-28 12:23:18.277549 osd.3 [ERR] 1.6 deep-scrub 1 errors
^C


David

On 10/28/15 3:34 AM, 池信泽 wrote:

Are you sure the osd begin to scrub? maybe you could check it from osd
log, or using 'ceph pg dump' to
check whether the scrub stamp changes or not.
Because there is some strategy which would reject the scrub command
Such as the system load , osd_scrub_min_interval,
osd_deep_scrub_interval and so on

2015-10-28 17:39 GMT+08:00 changtao381 :

Hi,

I’m testing the deep-scrub function of ceph.  And the test steps are below :

1)  I put an object on ceph using command :
  rados put test.txt test.txt –p testpool

The size of testpool is 3, so there three replicates on three osds:

osd.0:   /data1/ceph_data/osd.0/current/1.0_head/test.txt__head_8B0B6108__1
osd.1:   /data2/ceph_data/osd.1/current/1.0_head/test.txt__head_8B0B6108__1
osd.2/data3/ceph_data/osd.2/current/1.0_head/test.txt__head_8B0B6108__1

2) I modified the content of one replica on osd.0 using vim editor directly on 
disk

3) I run the command
 ceph pg deep-scrub 1.0

and expect it can check the inconsistent error out, but it fails. It doesn’t 
find the error
why?

Any suggestions will be appreciated! Thanks


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: pg scrub check problem

2015-10-28 Thread David Zafman


Good point.  In my previous response I did "echo garbage > 
./foo__head_7FC1F406__1" to corrupt a replica.


David

On 10/28/15 5:13 PM, Sage Weil wrote:

Becuse you *just* wrote the object, and the FileStore caches open file
handles.  Vim renames a new inode over the old one so the open inode is
untouched.

If you restart the osd and then scrub you'll see the error.

sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-addr

2015-10-12 Thread David Zafman


I don't understand how encode/decode of entity_addr_t is changing 
without versioning in the encode/decode.  This means that this branch is 
changing the ceph-objectstore-tool export format if 
CEPH_FEATURE_MSG_ADDR2 is part of the features.  So we could bump 
super_header::super_ver if the export format must change.


Now that I look at it, I'm sure I can clear the watchers and 
old_watchers in object_info_t during export because that is dynamic 
information and it happens to include entity_addr_t.  I need to verify 
this, but that may be the only reason that the objectstore tool needs a 
valid features value to be passed there.


David

On 10/9/15 2:49 PM, Sage Weil wrote:

2.
>(about line 2067 in src/tools/ceph_objectstore_tool.cc)
>(use via ceph cmd?) tools - "object store tool".
>This has a way to serialize objects which includes a watch list
>which includes an address.  There should be an option here to say
>whether to include exported addresses.

I think it's safe to use defaults here.. what do you think, David?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-07 Thread David Zafman


There would be a benefit to doing fadvise POSIX_FADV_DONTNEED after 
deep-scrub reads for objects not recently accessed by clients.


I see the NewStore objectstore sometimes using the O_DIRECT  flag for 
writes.  This concerns me because the open(2) man pages says:


"Applications should avoid mixing O_DIRECT and normal I/O to the same 
file, and especially to overlapping byte regions in the same file.  Even 
when the filesystem correctly handles the coherency issues in this 
situation, overall I/O throughput is likely to be slower than using 
either mode alone."


David

On 10/7/15 7:50 AM, Sage Weil wrote:

It's not, but it would not be ahrd to do this.  There are fadvise-style
hints being passed down that could trigger O_DIRECT reads in this case.
That may not be the best choice, though--it won't use data that happens
to be in cache and it'll also throw it out..

On Wed, 7 Oct 2015, Pawe? Sadowski wrote:


Hi,

Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
not able to verify that in source code.

If not would it be possible to add such feature (maybe config option) to
help keeping Linux page cache in better shape?

Thanks,

--
PS

___
ceph-users mailing list
ceph-us...@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Tracker 12577 repair won't fix replica with bad digest

2015-08-03 Thread David Zafman


Sage,

I restored the branch wip-digest-repair which merged post-hammer in pull 
request #4365.  Do you think that 4365 fixes the reported bug #12577?


I cherry-picked the 9 commits off of hammer-backports-next as pull 
request #5458 and assigned to Loic.


David


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-objectstore-tool import failures

2015-07-07 Thread David Zafman


I'm going to skip exporting of temp objects in a new wip-temp-zafman 
branch.Also, when we have persistent-temp objects, we'll probably 
need to enhance object_locator_to_pg() to adjust for negative pool numbers.


David

On 7/7/15 10:34 AM, Samuel Just wrote:

In the sense that the osd will still clear them, sure.  I've changed my mind 
though, probably best to not import or export them for now, and update the code 
to handle the persistent-temp objects when they exist (by looking at the hash). 
 We don't record anything about the in progress push, so the recovery temp 
objects at least aren't valuable to keep around.
-Sam

- Original Message -
From: Sage Weil sw...@redhat.com
To: Samuel Just sj...@redhat.com
Cc: David Zafman dzaf...@redhat.com, ceph-devel@vger.kernel.org
Sent: Tuesday, July 7, 2015 10:22:32 AM
Subject: Re: ceph-objectstore-tool import failures

On Tue, 7 Jul 2015, Samuel Just wrote:

If we think we'll want to persist some temp objects later on, probably
better to go ahead and export/import them now.

Replay isn't relevant here since it happens at a lower level.  The
ceph_objectstore_tool does do a kind of split during import since it
needs to be able to handle the case where the pg was split between the
import and the export.  In the event that temp objects need to persist
across intervals, we'll have to solve the problem of splitting the temp
objects in the osd as well as in the objectstore tool -- probably by
creating a class of persistent temp objects with non-fake hashes taken
from the corresponding non-temp object.

Yeah.. I suspect the right thing to do is make the temp object hash match
the eventual target hash.  We can do this now for the temp recovery
objects (even though they'll be deleted by the OSD).  Presumably the same
trick will work for recorded transaction objects too, or whatever
else...

In any case, for now the cot split can just look at hash like it does with
the non-temp objects and we're good, right?

sage



-Sam

- Original Message -
From: Sage Weil sw...@redhat.com
To: David Zafman dzaf...@redhat.com
Cc: sj...@redhat.com, ceph-devel@vger.kernel.org
Sent: Tuesday, July 7, 2015 10:00:09 AM
Subject: Re: ceph-objectstore-tool import failures

On Mon, 6 Jul 2015, David Zafman wrote:

Why import temp objects when clear_temp_objects() will just remove it on osd
start-up?

For now we could get away with skipping them, but I suspect in the future
there will be cases where we want to preserve them across restarts (for
example, when recording multi-object transactions that are not yet
committed).


If we need the temp objects for replay purposes, does it matter if a split has
occurred after the original export happened?

The replay should happen before the export... it's below the ObjectStore
interface, so I don't think it matters here.  I'm not sure about the split
implications, though.  Does the export/import have to do a split, or does
it let the OSD do that after it's imported?

sage


Or can we  just import all temporary objects without regards to split and
assume that after replay the clear_temp_objects() will
clean them up?

David


On 7/6/15 1:28 PM, Sage Weil wrote:

On Fri, 19 Jun 2015, David Zafman wrote:

This ghobject_t which has a pool of -3 is part of the export.   This
caused
the assert:

Read -3/1c/temp_recovering_1.1c_33'50_39_head/head

This was added by osd: use per-pool temp poolid for temp objects
18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch.

You should skip it on export or recreate it on import with special
handling.

Ah, that makes sense.  I think we should include these temp objects in the
export, though, and make cot understand that they are part of the pool.
We moved the clear temp objects on startup logic into teh OSD, which I
think will be useful for e.g. multiobject transactions (where we'll want
some objects that are internal/hidden to persist across peering intervals
and restarts).

Looking at your wip-temp-zafman, I think the first patch needs to be
dropped: include the temp objects, and I assume the meta one (which
has the pg log and other critical pg metadata).

Not sure where to change cot to handle the temp objects though?

Thanks!
sage





David

On 6/19/15 7:38 PM, David Zafman wrote:

Have not seen this as an assert before.  Given the code below in
do_import()
of master branch the assert is impossible (?).

if (!curmap.have_pg_pool(pgid.pgid.m_pool)) {
  cerr  Pool   pgid.pgid.m_pool   no longer exists 
std::endl;
  // Special exit code for this error, used by test code
  return 10;  // Positive return means exit status
}


David

On 6/19/15 7:25 PM, Sage Weil wrote:

Hey David,

On this run

  /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648

ceph-objectstore-tool is failing to import a pg because the pool
doesn't
exist.  It looks like the thrasher is doing an export+import and
racing
with a test that is tearing down a pool.  The crash is

ceph version

Re: ceph-objectstore-tool import failures

2015-07-06 Thread David Zafman


Why import temp objects when clear_temp_objects() will just remove it on 
osd start-up?


If we need the temp objects for replay purposes, does it matter if a 
split has occurred after the original export happened?


Or can we  just import all temporary objects without regards to split 
and assume that after replay the clear_temp_objects() will

clean them up?

David


On 7/6/15 1:28 PM, Sage Weil wrote:

On Fri, 19 Jun 2015, David Zafman wrote:

This ghobject_t which has a pool of -3 is part of the export.   This caused
the assert:

Read -3/1c/temp_recovering_1.1c_33'50_39_head/head

This was added by osd: use per-pool temp poolid for temp objects
18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch.

You should skip it on export or recreate it on import with special handling.

Ah, that makes sense.  I think we should include these temp objects in the
export, though, and make cot understand that they are part of the pool.
We moved the clear temp objects on startup logic into teh OSD, which I
think will be useful for e.g. multiobject transactions (where we'll want
some objects that are internal/hidden to persist across peering intervals
and restarts).

Looking at your wip-temp-zafman, I think the first patch needs to be
dropped: include the temp objects, and I assume the meta one (which
has the pg log and other critical pg metadata).

Not sure where to change cot to handle the temp objects though?

Thanks!
sage





David

On 6/19/15 7:38 PM, David Zafman wrote:

Have not seen this as an assert before.  Given the code below in do_import()
of master branch the assert is impossible (?).

   if (!curmap.have_pg_pool(pgid.pgid.m_pool)) {
 cerr  Pool   pgid.pgid.m_pool   no longer exists  std::endl;
 // Special exit code for this error, used by test code
 return 10;  // Positive return means exit status
   }


David

On 6/19/15 7:25 PM, Sage Weil wrote:

Hey David,

On this run

 /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648

ceph-objectstore-tool is failing to import a pg because the pool doesn't
exist.  It looks like the thrasher is doing an export+import and racing
with a test that is tearing down a pool.  The crash is

   ceph version 9.0.1-955-ge274efa
(e274efa450e99a68c02bcb713c8837d7809f1ec3)
   1: ceph-objectstore-tool() [0xa26335]
   2: (()+0xfcb0) [0x7f10cef18cb0]
   3: (gsignal()+0x35) [0x7f10cd5af425]
   4: (abort()+0x17b) [0x7f10cd5b2b8b]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d]
   6: (()+0xb5846) [0x7f10cdf00846]
   7: (()+0xb5873) [0x7f10cdf00873]
   8: (()+0xb596e) [0x7f10cdf0096e]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb0ce09]
   10: (ObjectStoreTool::get_object(ObjectStore*, coll_t,
ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f]
   11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool,
std::string)+0x13dd) [0x64a62d]
   12: (main()+0x3017) [0x632037]
   13: (__libc_start_main()+0xed) [0x7f10cd59a76d]
   14: ceph-objectstore-tool() [0x639119]

I don't think this is related to my branch.. but maybe?  Have you seen
this?  I rebased onto latest master yesterday.

sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting objects from a pool

2015-06-26 Thread David Zafman


This is a dangerous command because it can remove all your objects. At 
least it can only do one namespace at a time.  It was intended to 
cleanup rados bench runs, and is dangerous because it doesn't require 
extra hoops like rados rmpool does.


I'm tempted to disallow a usage this way with empty --prefix/--run-name 
arguments.


David

On 6/25/15 10:40 PM, Podoski, Igor wrote:

Hi David,

You're right, now I see adding --run-name  will clean all benchmark data from 
specified namespace, so you can run command only once.

rados -p poolname -N namespace cleanup --prefix  --run-name 

Regards,
Igor.


-Original Message-
From: David Zafman [mailto:dzaf...@redhat.com]
Sent: Friday, June 26, 2015 3:46 AM
To: Podoski, Igor; Deneau, Tom; Dałek, Piotr; ceph-devel
Subject: Re: deleting objects from a pool


If you have rados bench data around, you'll need to run cleanup a second time because the 
first time the benchmark_last_metadata object will be consulted to find what 
objects to remove.

Also, using cleanup this way will only remove objects from the default 
namespace unless a namespace is specified with the -N option.

rados -p poolname -N namespace cleanup --prefix 

David

On 6/24/15 11:06 PM, Podoski, Igor wrote:

Hi,

It appears, that cleanup can be used as a purge:

rados -p poolname cleanup  --prefix 

Regards,
Igor.


-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom
Sent: Wednesday, June 24, 2015 10:22 PM
To: Dałek, Piotr; ceph-devel
Subject: RE: deleting objects from a pool

I've noticed that deleting objects from a basic k=2 m=1 erasure pool is much 
much slower than deleting a similar number of objects from a replicated size 3 
pool (so the same number of files to be deleted).   It looked like the ec pool 
object deletion was almost 20x slower.  Is there a lot more work to be done to 
delete an ec pool object?

-- Tom




-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Dalek, Piotr
Sent: Wednesday, June 24, 2015 11:56 AM
To: ceph-devel
Subject: Re: deleting objects from a pool


-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Deneau, Tom
Sent: Wednesday, June 24, 2015 6:44 PM

I have benchmarking situations where I want to leave a pool around
but delete a lot of objects from the pool.  Is there any really fast
way to do

that?

I noticed rados rmpool is fast but I don't want to remove the pool.

I have been spawning multiple threads, each deleting a subset of the

objects

(which I believe is what rados bench write does) but even that can
be very slow.

For now, apart from rados -p poolname cleanup (which doesn't
purge the pool, but merely removes objects written during last
benchmark run), the only option is by brute force:

for i in $(rados -p poolname ls); do (rados -p poolname rm $i
/dev/null ); done;

There's no purge pool command in rados -- not yet, at least. I was
thinking about one, but never really had time to implement one.

With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: deleting objects from a pool

2015-06-25 Thread David Zafman


If you have rados bench data around, you'll need to run cleanup a second 
time because the first time the benchmark_last_metadata object

will be consulted to find what objects to remove.

Also, using cleanup this way will only remove objects from the default 
namespace unless a namespace is specified with the -N option.


rados -p poolname -N namespace cleanup --prefix 

David

On 6/24/15 11:06 PM, Podoski, Igor wrote:

Hi,

It appears, that cleanup can be used as a purge:

rados -p poolname cleanup  --prefix 

Regards,
Igor.


-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom
Sent: Wednesday, June 24, 2015 10:22 PM
To: Dałek, Piotr; ceph-devel
Subject: RE: deleting objects from a pool

I've noticed that deleting objects from a basic k=2 m=1 erasure pool is much 
much slower than deleting a similar number of objects from a replicated size 3 
pool (so the same number of files to be deleted).   It looked like the ec pool 
object deletion was almost 20x slower.  Is there a lot more work to be done to 
delete an ec pool object?

-- Tom




-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Dalek, Piotr
Sent: Wednesday, June 24, 2015 11:56 AM
To: ceph-devel
Subject: Re: deleting objects from a pool


-Original Message-
From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
ow...@vger.kernel.org] On Behalf Of Deneau, Tom
Sent: Wednesday, June 24, 2015 6:44 PM

I have benchmarking situations where I want to leave a pool around
but delete a lot of objects from the pool.  Is there any really fast
way to do

that?

I noticed rados rmpool is fast but I don't want to remove the pool.

I have been spawning multiple threads, each deleting a subset of the

objects

(which I believe is what rados bench write does) but even that can
be very slow.

For now, apart from rados -p poolname cleanup (which doesn't purge
the pool, but merely removes objects written during last benchmark
run), the only option is by brute force:

for i in $(rados -p poolname ls); do (rados -p poolname rm $i
/dev/null ); done;

There's no purge pool command in rados -- not yet, at least. I was
thinking about one, but never really had time to implement one.

With best regards / Pozdrawiam
Piotr Dałek
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph-objectstore-tool import failures

2015-06-19 Thread David Zafman


Have not seen this as an assert before.  Given the code below in 
do_import() of master branch the assert is impossible (?).


  if (!curmap.have_pg_pool(pgid.pgid.m_pool)) {
cerr  Pool   pgid.pgid.m_pool   no longer exists  
std::endl;

// Special exit code for this error, used by test code
return 10;  // Positive return means exit status
  }


David

On 6/19/15 7:25 PM, Sage Weil wrote:

Hey David,

On this run

/a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648

ceph-objectstore-tool is failing to import a pg because the pool doesn't
exist.  It looks like the thrasher is doing an export+import and racing
with a test that is tearing down a pool.  The crash is

  ceph version 9.0.1-955-ge274efa
(e274efa450e99a68c02bcb713c8837d7809f1ec3)
  1: ceph-objectstore-tool() [0xa26335]
  2: (()+0xfcb0) [0x7f10cef18cb0]
  3: (gsignal()+0x35) [0x7f10cd5af425]
  4: (abort()+0x17b) [0x7f10cd5b2b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d]
  6: (()+0xb5846) [0x7f10cdf00846]
  7: (()+0xb5873) [0x7f10cdf00873]
  8: (()+0xb596e) [0x7f10cdf0096e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb0ce09]
  10: (ObjectStoreTool::get_object(ObjectStore*, coll_t,
ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f]
  11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool,
std::string)+0x13dd) [0x64a62d]
  12: (main()+0x3017) [0x632037]
  13: (__libc_start_main()+0xed) [0x7f10cd59a76d]
  14: ceph-objectstore-tool() [0x639119]

I don't think this is related to my branch.. but maybe?  Have you seen
this?  I rebased onto latest master yesterday.

sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in


Re: ceph-objectstore-tool import failures

2015-06-19 Thread David Zafman


This ghobject_t which has a pool of -3 is part of the export.   This 
caused the assert:


Read -3/1c/temp_recovering_1.1c_33'50_39_head/head

This was added by osd: use per-pool temp poolid for temp objects 
18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch.


You should skip it on export or recreate it on import with special handling.

David

On 6/19/15 7:38 PM, David Zafman wrote:


Have not seen this as an assert before.  Given the code below in 
do_import() of master branch the assert is impossible (?).


  if (!curmap.have_pg_pool(pgid.pgid.m_pool)) {
cerr  Pool   pgid.pgid.m_pool   no longer exists  
std::endl;

// Special exit code for this error, used by test code
return 10;  // Positive return means exit status
  }


David

On 6/19/15 7:25 PM, Sage Weil wrote:

Hey David,

On this run

/a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648

ceph-objectstore-tool is failing to import a pg because the pool doesn't
exist.  It looks like the thrasher is doing an export+import and racing
with a test that is tearing down a pool.  The crash is

  ceph version 9.0.1-955-ge274efa
(e274efa450e99a68c02bcb713c8837d7809f1ec3)
  1: ceph-objectstore-tool() [0xa26335]
  2: (()+0xfcb0) [0x7f10cef18cb0]
  3: (gsignal()+0x35) [0x7f10cd5af425]
  4: (abort()+0x17b) [0x7f10cd5b2b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d]
  6: (()+0xb5846) [0x7f10cdf00846]
  7: (()+0xb5873) [0x7f10cdf00873]
  8: (()+0xb596e) [0x7f10cdf0096e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x259) [0xb0ce09]
  10: (ObjectStoreTool::get_object(ObjectStore*, coll_t,
ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f]
  11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool,
std::string)+0x13dd) [0x64a62d]
  12: (main()+0x3017) [0x632037]
  13: (__libc_start_main()+0xed) [0x7f10cd59a76d]
  14: ceph-objectstore-tool() [0x639119]

I don't think this is related to my branch.. but maybe?  Have you seen
this?  I rebased onto latest master yesterday.

sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in


rsyslogd

2015-06-18 Thread David Zafman


Greg,

Have you changed anything (log rotation related?) that would uninstall 
or  cause rsyslog to not be able to start?


I'm sometimes seeing machines fail with this error probably in 
teuthology/nuke.py reset_syslog_dir().


CommandFailedError: Command failed on plana94 with status 1: 'sudo rm -f 
-- /etc/rsyslog.d/80-cephtest.conf  sudo service rsyslog restart'



David



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Racing read got wrong version' during proxy write testing

2015-06-03 Thread David Zafman


I'm wonder if this issue could be the cause of #11511.  Could a proxy 
write have raced with the fill_in_copy_get() so object_info_t size 
doesn't correspond with the size of the object in the filestore?


David


On 6/3/15 6:22 PM, Wang, Zhiqiang wrote:

Making the 'copy get' op to be a cache op seems like a good idea.

-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Thursday, June 4, 2015 9:14 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: RE: 'Racing read got wrong version' during proxy write testing

On Wed, 3 Jun 2015, Wang, Zhiqiang wrote:

I ran into the 'op not idempotent' problem during the testing today.
There is one bug in the previous fix. In that fix, we copy the reqids
in the final step of 'fill_in_copy_get'. If the object is deleted,
since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op.
No reqids will be copied during promotion in this case. This again
leads to the 'op not idempotent' problem. We need a 'smart' way to
detect the op is a 'copy get' op (looping the ops vector doesn't seem
smart?) and copy the reqids in this case.

Hmm.  I think the idea here is/was that that ENOENT would somehow include the 
reqid list from PGLog::get_object_reqids().

I think teh trick is getting it past the generic check in do_op:

   if (!op-may_write() 
   !op-may_cache() 
   (!obc-obs.exists ||
((m-get_snapid() != CEPH_SNAPDIR) 
obc-obs.oi.is_whiteout( {
 reply_ctx(ctx, -ENOENT);
 return;
   }

Maybe we mark these as cache operations so that may_cache is true?

Sam, what do you think?

sage



-Original Message-
From: Sage Weil [mailto:sw...@redhat.com]
Sent: Tuesday, May 26, 2015 12:27 AM
To: Wang, Zhiqiang
Cc: ceph-devel@vger.kernel.org
Subject: Re: 'Racing read got wrong version' during proxy write
testing

On Mon, 25 May 2015, Wang, Zhiqiang wrote:

Hi all,

I ran into a problem during the teuthology test of proxy write. It is like this:

- Client sends 3 writes and a read on the same object to base tier
- Set up cache tiering
- Client retries ops and sends the 3 writes and 1 read to the cache
tier
- The 3 writes finished on the base tier, say with versions v1, v2
and
v3
- Cache tier proxies the 1st write, and start to promote the object
for the 2nd write, the 2nd and 3rd writes and the read are blocked
- The proxied 1st write finishes on the base tier with version v4,
and returns to cache tier. But somehow the cache tier fails to send
the reply due to socket failure injecting
- Client retries the writes and the read again, the writes are
identified as dup ops
- The promotion finishes, it copies the pg_log entries from the base
tier and put it in the cache tier's pg_log. This includes the 3
writes on the base tier and the proxied write
- The writes dispatches after the promotion, they are identified as
completed dup ops. Cache tier replies these write ops with the
version from the base tier (v1, v2 and v3)
- In the last, the read dispatches, it reads the version of the
proxied write (v4) and replies to client
- Client complains that 'racing read got wrong version'

In a previous discussion of the 'ops not idempotent' problem, we solved it by 
copying the pg_log entries in the base tier to cache tier during promotion. 
Seems like there is still a problem with this approach in the above scenario. 
My first thought is that when proxying the write, the cache tier should use the 
original reqid from the client. But currently we don't have a way to pass the 
original reqid from cache to base. Any ideas?

I agree--I think the correct fix here is to make the proxied op be recognized 
as a dup.  We can either do that by passing in an optional reqid to the 
Objecter, or extending the op somehow so that both reqids are listed.  I think 
the first option will be cleaner, but I think we will also need to make sure 
the 'retry' count is preserved as (I think) we skip the dup check if retry==0.  
And we probably want to preserve the behavior that a given (reqid, retry) only 
exists once in the system.

This probably means adding more optional args to Objecter::read()...?

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: should we prepare to release firefly v0.80.10 ?

2015-04-21 Thread David Zafman


In early march I ran rados:thrash on the firefly backport of the 
ceph-objectstore-tool changes (wip-cot-firefly).  We considered it 
passed, even though an obscure segfault was seen:


bug #11141: Segmentation Violation: ceph-objectstore-tool doing --op 
list-pgs


David


On 4/21/15 8:52 AM, Sage Weil wrote:

The bulk of it is ceph-objectstore-tool, which is important to get into a
release, IMO.  David, are these being tested in the firefly thrashing
tests yet?

The only other one I'm worried about is

6fd3dfa osd: do not ignore deleted pgs on startup

Sam, I assume the recent hammer upgrade issue is would bite firefly folks
who upgrade too?

sage


On Tue, 21 Apr 2015, Loic Dachary wrote:


Hi Sage,

The firefly branch has a number of fixes ( 
http://tracker.ceph.com/issues/11090#Release-information ) and has been used 
for upgrade tests in the past few weeks. A few other issues have been 
backported since and are being tested in the integration branch ( 
http://tracker.ceph.com/issues/11090#teuthology-run-commitb91bbb434e6363a99a632cf3841f70f1f2549f79-integration-branch-april-2015
 ).

Do you think these changes deserve a firefly v0.80.10 release ? Should we ask 
each lead for their approval ? Or is it better to keep backporting what needs 
to be and wait a few weeks ?

Cheers

--
Loïc Dachary, Artisan Logiciel Libre



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: regenerating man pages

2015-03-17 Thread David Zafman


I found that I could not build the docs on Ubuntu 14.10 with the proper 
packages installed.  Kefu is looking into Asphyxiate which is very 
tempermental.  I installed an Ubuntu 11.10 in order to generate docs.


David

On 3/17/15 10:11 AM, Sage Weil wrote:

On Tue, 17 Mar 2015, Josh Durgin wrote:

On 03/17/2015 09:40 AM, Ken Dreyer wrote:

I had a question about the way that we're handling man pages.

In 356a749f63181d401d16371446bb8dc4f196c2a6 , rbd: regenerate rbd(8)
man page, it looks like man/rbd.8 was regenerated from doc/man/8/rbd.rst

It seems like it would be more efficient to avoid storing man pages in
Git and generate them dynamically at build time instead?

Yes, that'd be great!


https://github.com/ceph/ceph/blob/master/admin/manpage-howto.txt

admin/build-doc does a lot of things (including man page generation).
Could we simply run the sphinx-build -b man part at build time as a
part of make?

I don't see a reason not to. It's just a matter of making it work on all
the platforms we're building packages for. That might be annoying for
the entirety of build-doc, but for just building man pages it should
be simple.

I think the original reason we didn't was just because there are a lot of
dependencies for building the docs, so this inflates Build-Depends.  That
doesn't particularly bother me, though, if the deps do in fact exist.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Hammer incompat bits and ceph-objectstore-tool

2015-03-17 Thread David Zafman


During upgrade testing an error occurred because ceph-objectstore-tool 
found during import on a Firefly node the compat_features from a export 
from Hammer.


There are 2 new feature bits set as shown in the error message:

Export has incompatible features set 
compat={},rocompat={},incompat={12=transaction hints,13=pg meta object}


In this case as far as I can tell these osd incompatible changes 
wouldn't make the export data in any way incompatible.  So we may have 
to check compatibility bits on a case by case basis, if we want to allow 
the tool to work in the most cases possible.


During upgrade testing it is interesting that one node has the 
transaction hints feature, but other nodes still running firefly don't.  
Is this a case where we don't have to wait for all  OSDs to update 
before the cluster can start handling OP_COLL_HINT operations?


David Zafman

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Building documentation

2015-03-09 Thread David Zafman


I was having trouble building man pages on my Ubuntu 14.04 build 
machine, so I looked at gitbuilder-doc.  I saw that it was running 
Ubuntu 11.10.  Even though the end-of-life for Ubuntu 11.10 was May 9, 
2013, I installed a new virtual machine with it. I needed to change 
/etc/apt/sources.list in use old-releases.ubuntu.com to install 
additional packages.  Just like gitbuilder-doc the admin/build-doc 
command runs without errors.


I assume other distributions with more up to date packages will see the 
same problem.  I filed bug #11077 with the sphinx log attached.


David Zafman
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Clocks out of sync

2015-02-20 Thread David Zafman


On 2 of my rados thrash runs clocks out of sync.   Is this an occasional 
issue or did we have an infrastructure problem?


On burnupi19 and burnupi25:
2015-02-20 12:52:52.636017 mon.1 10.214.134.14:6789/0 177 : cluster 
[WRN] message from mon.0 was stamped 0.501458s in the future, clocks not 
synchronized


On plana62 and plana64:
2015-02-20 10:00:56.842533 mon.0 10.214.132.14:6789/0 3 : cluster [WRN] 
message from mon.1 was stamped 0.855106s in the future, clocks not 
synchronized



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Disk failing plana74

2015-02-20 Thread David Zafman


A recent test run had an EIO on the following disk:

plana74 /dev/sdb

The machine is locked right now.

David Zafman
Senior Developer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sage-2015-02-15_07:44:23-rados-hammer-distro-basic-multi failures

2015-02-16 Thread David Zafman


There were 24 failures before the run was killed.

758289 757223
FAILED assert(weak_refs.empty()) saw valgrind issues”
Filed bug #10901

757405
failed to become clean before timeout expired
osd.4 killed and never restarted, Thrasher may have died
Filed bug #10902

758034:
mira038 disk I/O error ceph-1 on /dev/sdf
Sandon is aware

757087 757162 758292 758300 758071
already fixed bug #10784 Watch timeout

757177 757385 757506
No JSON object could be decoded bug #10630

757185
osd/ReplicatedPG.cc: 12991: FAILED assert(obc) bug #10820 (testing)

757431 757365
infrastructure - could not read lock status

757601 75 757952
infrastructure - too many values to unpack (immediately after locking 
machines)


757895
FAILED assert(0 == racing read got wrong version): already fixed bug 
#10830


758070 758244 757075 757251 757426
infrastructure?Immediate osd crash ERROR: osd init failed: (1) 
Operation not permitted


David Zafman


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


LTTNG

2015-02-03 Thread David Zafman


On Ubuntu 12.04.1 LTS after doing an install-deps.sh and the new 
do_autogen.sh without -L, I get a config error with this in the config.log:


configure:22637: checking if lttng-gen-tp is sane
configure:22647: result: no
configure:22681: checking lttng/tracepoint.h usability
configure:22681: gcc -c  -g -Wextra -Wno-missing-field-initializers 
-Wno-missing-declarations -Wno-unused-parameter  conftest.c 5

configure:22681: $? = 0
configure:22681: result: yes
configure:22681: checking lttng/tracepoint.h presence
configure:22681: gcc -E  conftest.c
configure:22681: $? = 0
configure:22681: result: yes
configure:22681: checking for lttng/tracepoint.h
configure:22681: result: yes
configure:22692: checking for lttng-gen-tp
configure:22708: found /usr/bin/lttng-gen-tp
configure:22719: result: yes
configure:22737: error: in `/home/dzafman/ceph2':
configure:22739: error: lttng-gen-tp does not behave properly

David Zafman
Senior Developer
http://www.redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 'Immutable bit' on pools to prevent deletion

2015-01-17 Thread David Zafman


The most secure way would be one in which you can only create pools with 
WORM set and can't ever change the WORM state of a pool.  I like this 
simple/secure approach as a first cut.


David

On 1/17/15 11:09 AM, Alex Elsayed wrote:

Sage Weil wrote:


On Fri, 16 Jan 2015, Alex Elsayed wrote:

Wido den Hollander wrote:

snip

Is it a sane thing to look at 'features' which pools could have? Other
features which might be set on a pool:

- Read Only (all write operations return -EPERM)
- Delete Protected

There's another pool feature I'd find very useful: a WORM flag, that
permits only create  append (at the RADOS level, not the RBD level as
was an Emperor blueprint).

In particular, I'd _love_ being able to make something that takes
Postgres WAL logs and puts them in such a pool, providing real guarantees
re: consistency. Similarly, audit logs and such for compliance.

How would you want this to work?

- If the bit is set, object creates are allowed, but not deletes?  What
about append?

- Are you allowed to clear the bit with something like 'ceph osd pool set
pool worm false' ?

I'd say that a WORM pool would allow 'create' and 'append' only - that fits
well with the classic notions of WORM media, and would allow natural
implementations of virtualized WORM tape libraries and such for people who
need compatibility (if only object creation was supported, you get issues
where either you have absurd tiny objects or risk data loss on failure to
write a larger buffered chunk). Similarly, audit records (like selinux logs,
that might be written by log daemons) don't really come in nice object-sized
chunks. You want to always have the newest in the archive, so you really
don't want to buffer up to that.

I'd also figure that WORM's main purpose is assurance/compliance - you want
to _know_ that nobody could have turned the bit off, futzed with the data,
and then turned it back on. Otherwise, you'd just write your clients to only
use create/append, and have no need for WORM at the pool level. Because of
that, if the flag can be cleared via commands, it should be possible for the
admins to forbid it (by flat denying it in config, via some keying system,
via the bits being a ratchet, whatever - I'm not especially concerned by how
the guarantee is provided, so long as it can be).

Setting it should probably also be privileged, since it'd be trivial to
cause a DOS by setting it on (say) a CephFS pool - although handling that
concern is likely out-of-scope for now, since there are easier ways to ruin
someone's day at the RADOS level.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some gitbuilders not working

2015-01-08 Thread David Zafman


We are seeing gitbuilder failures.  This is what I saw on one.

error: Failed build dependencies:
xmlstarlet is needed by ceph-1:0.90-821.g680fe3c.el7.x86_64

David Zafman
Senior Developer
http://www.redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Some gitbuilders not working

2015-01-08 Thread David Zafman


We are seeing gitbuilder failures.  This is what I saw on one.

error: Failed build dependencies:
xmlstarlet is needed by ceph-1:0.90-821.g680fe3c.el7.x86_64
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph-objectstore-tool and make check

2014-12-19 Thread David Zafman


The objectstore tool has been renamed from ceph_objectstore_tool to 
ceph-objectstore-tool.


Please remove src/.libs/ceph_objectstore_tool and 
src/.libs/lt-ceph_objectstore_tool or do a make clean with latest 
master branch.  Otherwise, a local make check can fail because the old 
binary of the tool will always be executed.



David Zafman
Senior Developer
http://www.redhat.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Pull requests : speed up the reviews

2014-11-09 Thread David Zafman

I know I had a couple of pull requests that we weren’t going to merge until 
after the giant release.  This may have applied some of the other ones too.  In 
addition, It isn’t surprising that with a new release some non-release code 
reviews would be neglected.

That being said, this is a good time to remind people to dedicate time to code 
reviews.

David Zafman
Senior Developer
http://www.inktank.com




 On Nov 9, 2014, at 4:08 AM, Joao Eduardo Luis j...@redhat.com wrote:
 
 On 11/08/2014 05:32 PM, Loic Dachary wrote:
 Hi Ceph,
 
 In the past few weeks the number of pending pull requests grew from around 
 20 to over 80. The good thing is that there are more contributions, the 
 problem is that it requires more reviewers. Ceph is not the only project 
 suffering from this kind of problem and attending the OpenStack summit last 
 week reminded me that the sooner it is addressed the better.
 
 After a few IRC discussions some ideas came up and my favorite is that every 
 developer paid full time to work on Ceph dedicates a daily 15 minutes time 
 slot, time boxed, to review pull requests. Timeboxing is kind of frustrating 
 because some reviews require more. It basically means one has to focus on 
 the pull request for ten minutes at most and take five minutes to write a 
 useful comment that helps the author moving forward. But it also is the only 
 way to make room for a daily activity with no risk of postponing it because 
 something more urgent came up.
 
 What do you think ?
 
 On my calendar, I do have a time slot of one hour each morning to review pull 
 requests and mailing lists but I seldom honor it, especially when I'm caught 
 up in other stuff.
 
 I'll move it over to lunch so that it has no chance in interfering with other 
 tasks and try to make a habit of it.
 
 It would also be interesting to see more community involvement.  I believe it 
 would be healthy for the project if we could have (at least) a portion of 
 reviews being performed by other people besides solely the paid 
 developers/maintainers.
 
  -Joao
 
 -- 
 Joao Eduardo Luis
 Software Engineer | http://ceph.com
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can pid be reused ?

2014-10-22 Thread David Zafman

I just realized what it is.  The way killall is used when stopping a vstart 
cluster, is to kill all processes by name!  You can't stop vstarted tests 
running in parallel.

David Zafman
Senior Developer
http://www.inktank.com




 On Oct 21, 2014, at 7:55 PM, Loic Dachary l...@dachary.org wrote:
 
 Hi,
 
 Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. 
 Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process 
 gets killed from time to time. For instance it shows as
 
 TEST_erasure_crush_stripe_width: 124: stripe_width=4096
 TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 
 12 erasure
 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
 ./test/mon/osd-pool-create.sh: line 120: 27557 Killed  ./ceph 
 osd pool create pool_erasure 12 12 erasure
 TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump
 TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json
 
 in the test logs. Note the 27557 Killed . I originally thought it was because 
 some ulimit was crossed and set them to very generous / unlimited hard / soft 
 thresholds.
 
 core file size  (blocks, -c) 0
  
 data seg size   (kbytes, -d) unlimited
  
 scheduling priority (-e) 0
  
 file size   (blocks, -f) unlimited
  
 pending signals (-i) 515069   
  
 max locked memory   (kbytes, -l) unlimited
  
 max memory size (kbytes, -m) unlimited
  
 open files  (-n) 40   
  
 pipe size(512 bytes, -p) 8
  
 POSIX message queues (bytes, -q) 819200   
  
 real-time priority  (-r) 0
  
 stack size  (kbytes, -s) unlimited
  
 cpu time   (seconds, -t) unlimited
  
 max user processes  (-u) unlimited
  
 virtual memory  (kbytes, -v) unlimited
  
 file locks  (-x) unlimited
 
 Benoit Canet suggested that I installed systemtap ( 
 https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran 
 https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what 
 was sending the kill signal. It showed the following:
 
 ...
 SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001
 SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001
 
 
 which suggests that pid 27557 used by ceph-osd was reused for the python 
 script that was killed above. Because the script that kills daemons is very 
 agressive and kill -9 the pid to check if it really is dead
 
 https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64
 
 it explains the problem.
 
 However, as Dan Mick suggests, reusing pid quickly could break a number of 
 things and it is a surprising behavior. Maybe something else is going on. A 
 loop creating processes sees their pid increasing and not being reused.
 
 Any idea about what is going on would be much appreciated :-)
 
 Cheers
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can pid be reused ?

2014-10-22 Thread David Zafman

 On Oct 22, 2014, at 3:43 PM, Sage Weil s...@newdream.net wrote:
 
 On Wed, 22 Oct 2014, David Zafman wrote:
 I just realized what it is.  The way killall is used when stopping a 
 vstart cluster, is to kill all processes by name!  You can't stop 
 vstarted tests running in parallel.
 
 Ah.  FWIW I think we should avoid using stop.sh whenever possible and 
 instead do ./init-ceph stop (which does an orderly shutdown via pid 
 files).
 
 sage

Actually, vstart.sh can’t create 2 independent clusters anyway, so it kills any 
existing processes.  Probably vstart.sh is what would have killed the processes 
in a parallel make check.

David--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vstart.sh crashes MON with --paxos-propose-interval=0.01 and one MDS

2014-10-16 Thread David Zafman

I have this change in my branch so that test/ceph_objectstore_tool.py works 
again after that change from John.  I wonder if this would fix your case too:

commit 18937cf49be616d32b4e2d0b6deef2882321fbe4
Author: David Zafman dzaf...@redhat.com
Date:   Tue Oct 14 18:45:41 2014 -0700

vstart.sh: Disable mon pg warn min per osd to get healthy

Signed-off-by: David Zafman dzaf...@redhat.com

diff --git a/src/vstart.sh b/src/vstart.sh
index febfa56..7a0ec1c 100755
--- a/src/vstart.sh
+++ b/src/vstart.sh
@@ -394,7 +394,7 @@ $COSDDEBUG
 $COSDMEMSTORE
 $extra_conf
 [mon]
-mon pg warn min per osd = 10
+mon pg warn min per osd = 0
 mon osd allow primary affinity = true
 mon reweight min pgs per osd = 4
 $DAEMONOPTS

David Zafman
Senior Developer
http://www.inktank.com




On Oct 16, 2014, at 3:52 PM, Loic Dachary l...@dachary.org wrote:

 Hi John,
 
 I would be gratefull if you could take a quick look at 
 http://tracker.ceph.com/issues/9794 . It is bisected to the reduction of pg 
 and I'm able to reproduce it in a ubuntu-14.04 docker fresh install. For some 
 reason it does not happen in gitbuilder but I think you can reproduce it 
 locally now.
 
 Cheers
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: make check failures

2014-10-08 Thread David Zafman

After updating my master branch make check” passes now.

David Zafman
Senior Developer
http://www.inktank.com




On Oct 7, 2014, at 11:28 PM, Loic Dachary l...@dachary.org wrote:

 [cc'ing the list in case someone else experiences problems with make check]
 
 Hi David,
 
 Yesterday you mentioned that make check is failing for you on master. Would 
 you be so kind as to send the logs ?
 
 Cheers
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


wip-libcommon-rebase

2014-08-29 Thread David Zafman

Adam  Sage,

The commit osd: make coll_t::META static to each file from wip-libcommon has 
been merged to master.  I created a new branch with the other commits on the 
latest master branch called wip-libcommon-rebase.**  It required some conflict 
resolution in ceph.spec.in.  

**No warranties are expressed or implied about the correctness or suitability 
of this branch for future use.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Testing intermediate code for improved namespace handling

2014-08-28 Thread David Zafman

Check default namespace (non specified)
This keeps the command output compatible with existing scripts
./rados -p test ls
default-obj8
default-obj10
default-obj6
default-obj7
default-obj1
default-obj2
default-obj3
default-obj4
default-obj5
default-obj9

Try for all namespaces
./rados -p test -N * ls
ns2 ns2-obj3
ns2 ns2-obj5
ns2 ns2-obj10
   default-obj8
ns2 ns2-obj4
ns2 ns2-obj2
ns2 ns2-obj8
   default-obj10
ns1 ns1-obj5
   default-obj6
   default-obj7
ns1 ns1-obj4
ns1 ns1-obj10
ns1 ns1-obj2
   default-obj1
   default-obj2
ns1 ns1-obj9
ns1 ns1-obj3
   default-obj3
ns1 ns1-obj6
ns1 ns1-obj1
ns2 ns2-obj7
ns2 ns2-obj9
ns1 ns1-obj8
   default-obj4
ns1 ns1-obj7
   default-obj5
ns2 ns2-obj6
ns2 ns2-obj1
   default-obj9

Try for only one specific namespace
./rados -p test -N ns1 ls
ns1-obj5
ns1-obj4
ns1-obj10
ns1-obj2
ns1-obj9
ns1-obj3
ns1-obj6
ns1-obj1
ns1-obj8
ns1-obj7

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Building a tool which links with librados

2014-08-21 Thread David Zafman

Has anyone seen anything like this from an application linked with librados 
using valgrind?  Or a Segmentation fault on exit from such an application?

Invalid free() / delete / delete[] / realloc()
at 0x4C2A4BC: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x8195C12: std::basic_stringchar, std::char_traitschar, 
std::allocatorchar ::~basic_string() (in 
/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16)
by 0x13890F3: coll_t::~coll_t() (osd_types.h:468)
by 0x8944DEC: __cxa_finalize (cxa_finalize.c:56)
by 0x6E1CEC5: ??? (in /src/ceph/src/.libs/librados.so.2.0.0)
by 0x725F400: ??? (in /src/ceph/src/.libs/librados.so.2.0.0)
by 0x89449D0: __run_exit_handlers (exit.c:78)
by 0x8944A54: exit (exit.c:100)
by 0x137FF37: usage(boost::program_options::options_description) 
(ceph_objectstore_tool.cc:1794)
by 0x1380572: main (ceph_objectstore_tool.cc:1849)

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Building a tool which links with librados

2014-08-21 Thread David Zafman

The import-rados feature (#8276) uses librados so in my wip-8231 branch I now 
link with librados.   It is hard to reproduce, but I’ll play with that commit 
and branch.

David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Aug 21, 2014, at 4:56 PM, Sage Weil sw...@redhat.com wrote:

 On Thu, 21 Aug 2014, Gregory Farnum wrote:
 On Thu, Aug 21, 2014 at 4:37 PM, David Zafman david.zaf...@inktank.com 
 wrote:
 
 Has anyone seen anything like this from an application linked with librados 
 using valgrind?  Or a Segmentation fault on exit from such an application?
 
 Invalid free() / delete / delete[] / realloc()
at 0x4C2A4BC: operator delete(void*) (in 
 /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
by 0x8195C12: std::basic_stringchar, std::char_traitschar, 
 std::allocatorchar ::~basic_string() (in 
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16)
by 0x13890F3: coll_t::~coll_t() (osd_types.h:468)
by 0x8944DEC: __cxa_finalize (cxa_finalize.c:56)
by 0x6E1CEC5: ??? (in /src/ceph/src/.libs/librados.so.2.0.0)
by 0x725F400: ??? (in /src/ceph/src/.libs/librados.so.2.0.0)
by 0x89449D0: __run_exit_handlers (exit.c:78)
by 0x8944A54: exit (exit.c:100)
by 0x137FF37: usage(boost::program_options::options_description) 
 (ceph_objectstore_tool.cc:1794)
by 0x1380572: main (ceph_objectstore_tool.cc:1849)
 
 This looks fairly strange to me ? why does ceph_objectstore_tool do
 anything with librados? I thought it was just hitting the OSD
 filesystem structure directly.
 Also note that the crash appears to be underneath the coll_t
 destructor, probably in destroying its string. That combined with the
 weird librados presence makes me think memory corruption is running
 over the stack somewhere.
 
 Ah, this was fixed in 5d79605319fcde330bccce5e1b07276a98be02de in the 
 wip-libcommon branch.  The problem is partly when we link libcommon 
 staticaly (ceph-objectstore-tool) and dynamically (librados) at teh same 
 time.  The easy fix here is not linking librados at all.
 
 Not sure why we see this sometimes and not always.. maybe link order?  In 
 any case, wip-libcommon moves libcommon.la into a .so shared between 
 librados and the binary using it to avoid the problem.  Makes things 
 slightly more restrictive with mixed versions, but i suspect it is worth 
 avoiding this sort of pain.
 
 Can you cherry-pick that commit and see if it resolves this for you?  
 And/or merge in that entire branch?
 
 sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] add rocksdb support

2014-06-13 Thread David Zafman

Don’t forget when a new submodule is added you need to initialize it.  From 
the README:

Building Ceph
=

To prepare the source tree after it has been git cloned,

$ git submodule update --init

To build the server daemons, and FUSE client, execute the following:

$ ./autogen.sh
$ ./configure
$ make


David Zafman
Senior Developer
http://www.inktank.com
http://www.redhat.com

On Jun 13, 2014, at 11:51 AM, Sushma Gurram sushma.gur...@sandisk.com wrote:

 Hi Xinxin,
 
 I tried to compile the wip-rocksdb branch, but the src/rocksdb directory 
 seems to be empty. Do I need toput autoconf/automake in this directory?
 It doesn't seem to have any other source files and compilation fails:
 os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory
 compilation terminated.
 
 Thanks,
 Sushma
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Shu, Xinxin
 Sent: Monday, June 09, 2014 10:00 PM
 To: Mark Nelson; Sage Weil
 Cc: ceph-devel@vger.kernel.org; Zhang, Jian
 Subject: RE: [RFC] add rocksdb support
 
 Hi mark
 
 I have finished development of support of rocksdb submodule,  a pull request 
 for support of autoconf/automake for rocksdb has been created , you can find 
 https://github.com/ceph/rocksdb/pull/2 , if this patch is ok ,  I will create 
 a pull request for rocksdb submodule support , currently this patch can be 
 found https://github.com/xinxinsh/ceph/tree/wip-rocksdb .
 
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
 Sent: Tuesday, June 10, 2014 1:12 AM
 To: Shu, Xinxin; Sage Weil
 Cc: ceph-devel@vger.kernel.org; Zhang, Jian
 Subject: Re: [RFC] add rocksdb support
 
 Hi Xinxin,
 
 On 05/28/2014 05:05 AM, Shu, Xinxin wrote:
 Hi sage ,
 I will add two configure options to --with-librocksdb-static and 
 --with-librocksdb , with --with-librocksdb-static option , ceph will compile 
 the code that get from ceph repository , with  --with-librocksdb option ,  
 in case of distro packages for rocksdb , ceph will not compile the rocksdb 
 code , will use pre-installed library. is that ok for you ?
 
 since current rocksdb does not support autoconfautomake , I will add 
 autoconfautomake support for rocksdb , but before that , i think we should 
 fork a stable branch (maybe 3.0) for ceph .
 
 I'm looking at testing out the rocksdb support as well, both for the OSD and 
 for the monitor based on some issues we've been seeing lately.  Any news on 
 the 3.0 fork and autoconf/automake support in rocksdb?
 
 Thanks,
 Mark
 
 
 -Original Message-
 From: Mark Nelson [mailto:mark.nel...@inktank.com]
 Sent: Wednesday, May 21, 2014 9:06 PM
 To: Shu, Xinxin; Sage Weil
 Cc: ceph-devel@vger.kernel.org; Zhang, Jian
 Subject: Re: [RFC] add rocksdb support
 
 On 05/21/2014 07:54 AM, Shu, Xinxin wrote:
 Hi, sage
 
 I will add rocksdb submodule into the makefile , currently we want to have 
 fully performance tests on key-value db backend , both leveldb and rocksdb. 
 Then optimize on rocksdb performance.
 
 I'm definitely interested in any performance tests you do here.  Last winter 
 I started doing some fairly high level tests on raw 
 leveldb/hyperleveldb/raikleveldb.  I'm very interested in what you see with 
 rocksdb as a backend.
 
 
 -Original Message-
 From: Sage Weil [mailto:s...@inktank.com]
 Sent: Wednesday, May 21, 2014 9:19 AM
 To: Shu, Xinxin
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: [RFC] add rocksdb support
 
 Hi Xinxin,
 
 I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that 
 includes the latest set of patches with the groundwork and your rocksdb 
 patch.  There is also a commit that adds rocksdb as a git submodule.  I'm 
 thinking that, since there aren't any distro packages for rocksdb at this 
 point, this is going to be the easiest way to make this usable for people.
 
 If you can wire the submodule into the makefile, we can merge this in so 
 that rocksdb support is in the ceph.com packages on ceph.com.  I suspect 
 that the distros will prefer to turns this off in favor of separate shared 
 libs, but they can do this at their option if/when they include rocksdb in 
 the distro. I think the key is just to have both --with-librockdb and 
 --with-librocksdb-static (or similar) options so that you can either use 
 the static or dynamically linked one.
 
 Has your group done further testing with rocksdb?  Anything interesting to 
 share?
 
 Thanks!
 sage
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

mon_command

2014-03-19 Thread David Zafman

My understanding is that are going to be using rados_mon_command() to create 
pools according to Tracker #7586: deprecate rados_pool_create.  What I found 
when building a test case for EC is that after using the mon_command to create 
the pool I need to use wait_for_latest_osdmap() in order to wait for the change 
to propagate.  The replicated pool test case using pool_create() doesn’t need 
to wait_for_latest_osdmap().

Should we be deprecating librados calls in favor of the very generic 
mon_command() interface?  I would suggest that we add the appropriate librados 
features to manipulate erasure coded pools.

David Zafman
Senior Developer
http://www.inktank.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 6685 backfill head/snapdir issue brain dump

2014-02-20 Thread David Zafman

Another way to look at this is to enumerate the recovery cases:

primary starts with head and no snapdir:

A   Recovery sets last_backfill_started to head and sends head object where 
needed
  head (1.b case while backfills in flight - 1.a when done)
  snapdir (2)

B   Recovery sets last_backfill_started to snapdir and would send snapdir 
remove(s) and same as above case for head
   head (1.b case while backfills in flight - 1.a when done)
   snapdir (1.a)

primary starts with snapdir and no head:

C   Recovery set last_backfill_started to head and sends remove of head
   head 1.a
   snapdir (2)

D   Recovery set last_backfill_started to snapdir and sends both remove of 
head and create of snapdir
   head 1.a
   snapdir (1.b case while backfills in flight - 1.a when done)


Cases B and D meet our criteria because they include head/snapdir = 
last_backfill_started and we check head and snapdir for is_degraded_object().  
Also, removes are always processed before creates even if recover_backfill() 
saw them in the other order (case B).  That way once the head objects are 
created (1.a) we know that all snapdirs have been removed too.  In other words 
these 2 cases do not allow an intervening operations to occur that confuses the 
head - snapdir state.

Case C is tricky.  An intervening write to head, requires update_range() 
determining that snapdir is gone even though had it not looked at the log it 
was going to try to recover (re-create) snapdir.

Case A is the only one which has a problem with an intervening deletion of the 
head object.


David



On Feb 20, 2014, at 12:07 PM, Samuel Just sam.j...@inktank.com wrote:

 The current implementation divides the hobject space into two sets:
 1) oid | oid = last_backfill_started
 2) oid | oid  last_backfill_started
 
 Space 1) is further divided into two sets:
 1.a) oid | oid \notin backfills_in_flight
 1.b) oid | oid \in backfills_in_flight
 
 The value of this division is that we must send ops in set 1.a to the
 backfill peer because we won't re-backfill those objects and they must
 therefore be kept up to date.  Furthermore, we *can* send the op
 because the backfill peer already has all of the dependencies (this
 statement is where we run into trouble).
 
 In set 2), we have not yet backfilled the object, so we are free to
 not send the op to the peer confident that the object will be
 backfilled later.
 
 In set 1.b), we block operations until the backfill operation is
 complete.  This is necessary at the very least because we are in the
 process of reading the object and shouldn't be sending writes anyway.
 Thus, it seems to me like we are blocking, in some sense, the minimum
 possible set of ops, which is good.
 
 The issue is that there is a small category of ops which violate our
 statement above that we can send ops in set 1.a: ops where the
 corresponding snapdir object is in set  2 or set 1.b.  The 1.b case we
 currently handle by requiring that snapdir also be
 !is_degraded_object.
 
 The case where the snapdir falls into set 2 should be the problem, but
 now I am wondering.  I think the original problem was as follows:
 1) advance last_backfill_started to head
 2) complete recovery on head
 3) accept op on head which deletes head and creates snapdir
 4) start op
 5) attempt to recover snapdir
 6) race with write and get screwed up
 
 Now, however, we have logic to delay backfill on ObjectContexts which
 currently have write locks.  It should suffice to take a write lock on
 the new snapdir and use that...which we do since the ECBackend patch
 series.  The case where we create head and remove snapdir isn't an
 issue since we'll just send the delete which will work whether snapdir
 exists or not...  We can also just include a delete in the snapdir
 creation transaction to make it correctly handle garbage snapdirs on
 backfill peers.  The snapdir would then be superfluously recovered,
 but that's probably ok?
 
 The main issue I see is that it would cause the primary's idea of the
 replica's backfill_interval to be slightly incorrect (snapdir would
 have been removed or created on the peer, but not reflected in the
 master's current backfill_interval which might contain snapdir).  We
 could adjust it in make_writeable, or update_range?
 
 Sidenote: multiple backfill peers complicates the issue only slightly.
 All backfill peers with last_backfill = last_backfill_started are
 handled uniformly as above.  Any backfill_peer with last_backfill 
 last_backfill_started we can model as having a private
 last_backfill_started equal to last_backfill.  This results in a
 picture for that peer identical to the one above with an empty set
 1.b.  Because 1.b is empty for these peers, is_degraded_object can
 disregard them.  should_send_op accounts for them with the
 MAX(last_backfill, last_backfill_started) adjustment.
 
 Anyone have anything 

wip-libcephfs-emp-rb

2013-10-07 Thread David Zafman

I rebased wip-libcephfs and pushed as wip-libcephfs-emp-rb so that we can get 
this in to the Emperor release.  Sage mentioned that he had hit a fuse problem 
in the wip-libcephfs branch, so apparently the problem is still present.  Have 
you run into this bug in your testing?  Are you testing with these modification 
to Ceph?

2013-10-04T10:39:01.664 
INFO:teuthology.task.workunit.client.0.out:[10.214.132.22]:   CC  
kernel/softirq.o
2013-10-04T10:39:02.059 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: *** Caught 
signal (Segmentation fault) **
2013-10-04T10:39:02.059 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  in thread 
7f57fa316780
2013-10-04T10:39:02.073 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  ceph version 
0.69-500-g09f4df0 (09f4df02a866230b19539b03061f4abc5ab47ae2)
2013-10-04T10:39:02.073 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  1: ceph-fuse() 
[0x5e0d1a]
2013-10-04T10:39:02.073 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  2: (()+0xfcb0) 
[0x7f57f9cc5cb0]
2013-10-04T10:39:02.073 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  3: 
(Client::_get_inodeno(Inode*)+0) [0x52dd10]
2013-10-04T10:39:02.073 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  4: 
(Client::ll_forget(Inode*, int)+0x4a) [0x538d1a]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  5: ceph-fuse() 
[0x52a1c5]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  6: 
(fuse_session_loop()+0x75) [0x7f57f9ee3d65]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  7: 
(main()+0x84c) [0x5266fc]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  8: 
(__libc_start_main()+0xed) [0x7f57f83f576d]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  9: ceph-fuse() 
[0x527c99]
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 2013-10-04 
10:39:02.072487 7f57fa316780 -1 *** Caught signal (Segmentation fault) **
2013-10-04T10:39:02.074 
INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]:  in thread 
7f57fa316780

David Zafman
Senior Developer
http://www.inktank.com




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: xattr limits

2013-10-04 Thread David Zafman

Here is the test script:



xattr-test.sh
Description: Binary data

David Zafman
Senior Developer
http://www.inktank.com




On Oct 3, 2013, at 11:02 PM, Loic Dachary l...@dachary.org wrote:

 Hi David,
 
 Would you mind attaching the script to the mail for completness ? It's a 
 useful thing to have :-)
 
 Cheers
 
 On 04/10/2013 01:21, David Zafman wrote:
 
 I want to record with the ceph-devel archive results from testing limits of 
 xattrs for Linux filesystems used with Ceph.
 
 Script that creates xattrs with name user.test1, user.test2, …. on a single 
 file
 3.10 linux kernel
 
 ext4  
 value bytes   number of entries
1   148
   16 103
  256  14
  5127
  1024 3
 4036  1 
 Beyond this immediately get ENOSPC
 
 btrfs
 value bytes   number of entries
 8 10k
 16   10k
 3210k
 6410k
 128   10k
 256   10k
 512  10k  slow but worked 1,000,000 got completely hung for 
 minutes at a time during removal strace showed no forward progress
 1024  10k
 2048  10k
 3096  10k
 Beyond this you start getting ENOSPC after fewer entries
 
 xfs (limit entries due to xfs crash with 10k entries)
 value bytes   number of entries
 11k
 8 1k
 161k
 321k
 64   1k
 128   1k
 256  1k
 512   1k
 1024 1k
 2048  1k
 4096  1k
 8192  1k
 16384 1k
 32768 1k
 65536 1k
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 All that is necessary for the triumph of evil is that good people do nothing.
 



RESEND: xattr issue with 3.11 kernel

2013-10-04 Thread David Zafman
`
setfattr --remove=$entry $FILENAME
if [ $? != 0 ];
then
  echo failure to remove $entry
  break
fi
rmcount=`expr $rmcount + 1`
  done
  getfattr --dump $FILENAME
  rmdir $FILENAME
done
rm src.$$

exit 0



David Zafman
Senior Developer
http://www.inktank.com




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] v0.67.4 released

2013-10-04 Thread David Zafman

Unit tests on v0.67.4 not passing.  Could be a test case needs to be fixed.

$ test/encoding/check-generated.sh
checking ceph-dencoder generated test instances...
numgen type
3 ACLGrant
…….
4 ObjectStore::Transaction
mon/PGMap.cc: In function 'void PGMap::apply_incremental(CephContext*, const 
PGMap::Incremental)' thread 7fac10e81780 time 2013-10-04 18:08:59.019448
mon/PGMap.cc: 226: FAILED assert(inc.get_osd_epochs().find(osd) != 
inc.get_osd_epochs().end())
 ceph version 0.69-548-ge927941 (e927941fcadff56483137cffc0899b4ab9c6c297)
 1: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const)+0x697) 
[0x948bc7]
 2: (PGMap::generate_test_instances(std::listPGMap*, std::allocatorPGMap* 
)+0xc3) [0x949433]
 3: (main()+0xce27) [0x5e48f7]
 4: (__libc_start_main()+0xed) [0x7fac0ef3176d]
 5: ./ceph-dencoder() [0x5eb749]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted

David Zafman
Senior Developer
http://www.inktank.com




On Oct 4, 2013, at 4:55 PM, Sage Weil s...@inktank.com wrote:

 This point release fixes an important performance issue with radosgw, 
 keystone authentication token caching, and CORS.  All users (especially 
 those of rgw) are encouraged to upgrade.
 
 Notable changes:
 
 * crush: fix invalidation of cached names
 * crushtool: do not crash on non-unique bucket ids
 * mds: be more careful when decoding LogEvents
 * mds: fix heap check debugging commands
 * mon: avoid rebuilding old full osdmaps
 * mon: fix 'ceph crush move ...'
 * mon: fix 'ceph osd crush reweight ...'
 * mon: fix writeout of full osdmaps during trim
 * mon: limit size of transactions
 * mon: prevent both unmanaged and pool snaps
 * osd: disable xattr size limit (prevents upload of large rgw objects)
 * osd: fix recovery op throttling
 * osd: fix throttling of log messages for very slow requests
 * rgw: drain pending requests before completing write
 * rgw: fix CORS
 * rgw: fix inefficient list::size() usage
 * rgw: fix keystone token expiration
 * rgw: fix minor memory leaks
 * rgw: fix null termination of buffer
 
 For more detail:
 
 * http://ceph.com/docs/master/release-notes/#v0-67-4-dumpling
 * http://ceph.com/docs/master/_downloads/v0.67.4.txt
 
 You can get v0.67.4 from the usual locations:
 
 * Git at git://github.com/ceph/ceph.git
 * Tarball at http://ceph.com/download/ceph-0.67.4.tar.gz
 * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian
 * For RPMs, see http://ceph.com/docs/master/install/rpm
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


xattr limits

2013-10-03 Thread David Zafman

I want to record with the ceph-devel archive results from testing limits of 
xattrs for Linux filesystems used with Ceph.

Script that creates xattrs with name user.test1, user.test2, …. on a single file
3.10 linux kernel

ext4  
value bytes   number of entries
1   148
   16 103
  256  14
  5127
  1024 3
4036  1 
Beyond this immediately get ENOSPC

btrfs
value bytes   number of entries
 8   10k
16  10k
32   10k
64   10k
128  10k
256  10k
512 10k  slow but worked 1,000,000 got completely hung for 
minutes at a time during removal strace showed no forward progress
1024 10k
2048 10k
 309610k
Beyond this you start getting ENOSPC after fewer entries

xfs (limit entries due to xfs crash with 10k entries)
value bytes   number of entries
1   1k
81k
16   1k
32   1k
64  1k
128  1k
256 1k
512  1k
10241k
2048 1k
4096 1k
8192 1k
163841k
327681k
655361k

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4 failed, 298 passed in dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana

2013-09-24 Thread David Zafman

The osd.4 crash in 13443 is bug #5951:

2013-09-23 21:23:28.378428 1034e700  0 filestore(/var/lib/ceph/osd/ceph-4)  
error (17) File exists not handled on operation 20 (6579.0.0, or op 0, counting 
from 0)
2013-09-23 21:23:28.862204 1034e700  0 filestore(/var/lib/ceph/osd/ceph-4) 
unexpected error code
2013-09-23 21:23:28.864816 1034e700  0 filestore(/var/lib/ceph/osd/ceph-4)  
transaction dump:
{ ops: [
{ op_num: 0,
  op_name: mkcoll,
  collection: 4.6_head},
{ op_num: 1,
  op_name: collection_setattr,
  collection: 4.6_head,
  name: info,
  length: 1},
{ op_num: 2,
  op_name: omap_setkeys,
  collection: meta,
  oid: 16ef7597\/infos\/head\/\/-1,
  attr_lens: { 4.6_biginfo: 125,
  4.6_epoch: 4,
  4.6_info: 576}},
{ op_num: 3,
  op_name: touch,
  collection: meta,
  oid: 1039d44e\/pglog_4.6\/0\/\/-1},
{ op_num: 4,
  op_name: omap_rmkeys,
  collection: meta,
  oid: 1039d44e\/pglog_4.6\/0\/\/-1},
{ op_num: 5,
  op_name: omap_setkeys,
  collection: meta,
  oid: 1039d44e\/pglog_4.6\/0\/\/-1,
  attr_lens: {}}]}

2013-09-23 21:23:28.959220 1a282700  5 osd.4 pg_epoch: 424 pg[32.0( empty 
local-les=424 n=0 ec=96 les/c 400/400 419/419/419) [4,1] r=0 lpr=419 
pi=364-418/6 mlcod 0'0 active] enter
 Started/Primary/Active/Activating
2013-09-23 21:23:29.116768 1034e700 -1 os/FileStore.cc: In function 'unsigned 
int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)' 
thread 1034e700 time 2013-0
 9-23 21:23:28.920055
os/FileStore.cc: 2461: FAILED assert(0 == unexpected error)

 ceph version 0.69-220-g4f7526a (4f7526a785692795ee29f7101b8b18482b4c6e11)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, 
int)+0xffc) [0x72473c]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, 
std::allocatorObjectStore::Transaction* , unsigned long, 
ThreadPool::TPHandle*)+0x71) [0x72b241]
 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x291) 
[0x72b4f1]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x93da36]
 5: (ThreadPool::WorkThread::entry()+0x10) [0x93f840]
 6: (()+0x7e9a) [0x503be9a]
 7: (clone()+0x6d) [0x6c71ccd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

David Zafman
Senior Developer
http://www.inktank.com

On Sep 24, 2013, at 12:03 PM, Sage Weil s...@inktank.com wrote:

 On Tue, 24 Sep 2013, David Zafman wrote:
 
 Rados suite test run results for wip-5862.  2 scrub mismatch from mon
 (known problem).  2 are valgrind issues found with mds and osd. 
 
 What is the osd valgrind failure?  And the osd.4 crash on 13443?
 
 (Note that the teuthology.log will include message about valgrind issues 
 found in the mds log, but does not generate an actual error about it.)
 
 Thanks!
 sage
 
 
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 Begin forwarded message:
 
  From: teuthwor...@teuthology.front.sepia.ceph.com
 Subject: 4 failed, 298 passed in
 dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana
 Date: September 23, 2013 10:48:00 PM PDT
 To: david.zaf...@inktank.com
 
 Test Run:
 dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana
 =
 logs:  
 http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wip
 -5862-testing-basic-plana/
 failed: 4
 hung:   0
 passed: 298
 
 Failed
 =
 [13187]  rados/monthrash/{ceph/ceph.yaml clusters/3-mons.yaml
 fs/xfs.yaml msgr-failures/mon-delay.yaml
 thrashers/force-sync-many.yaml workloads/snaps-few-objects.yaml}
 -
 time:   2095s
 log:   
 http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi
 p-5862-testing-basic-plana/13187/
 
2013-09-23 18:22:37.805204 mon.0 10.214.132.24:6789/0 514 : [ERR]
 scrub
mismatch in cluster log
 
 [13449]  rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml
 fs/btrfs.yaml msgr-failures/few.yaml tasks/rados_api_tests.yaml
 validater/valgrind.yaml}
 -
 time:   1067s
 log:   
 http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi
 p-5862-testing-basic-plana/13449/
 
saw valgrind issues
 
 [13443]  rados/verify/{1thrash/default.yaml clusters/fixed-2.yaml
 fs/btrfs.yaml msgr-failures/few.yaml tasks/rados_api_tests.yaml
 validater/valgrind.yaml}
 -
 time:   1307s
 log:   
 http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi
 p-5862-testing-basic-plana/13443/
 
timed out waiting for admin_socket to appear after osd.4 restart
 
 [13227]  rados/monthrash/{ceph/ceph.yaml

Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out

2013-04-26 Thread David Zafman

The behavior you are seeing is exactly what would be expected if OSDs are not 
being marked out.  The testing of my fix showed that if a portion of a rack's 
OSDs go down they will be marked out after the configured amount of time (5 min 
by default).  Once down OSDs are out the remaining OSDs take responsibility for 
holding the data assigned to that rack.

Though I didn't look at the data movement, I'm confident that it will work.  
You can simply mark your OSDs out manually to verify that missing replicas are 
replaced.

David Zafman
Senior Developer
http://www.inktank.com




On Apr 26, 2013, at 1:50 AM, Martin Mailand mar...@tuxadero.com wrote:

 Hi David,
 
 did you test it with more than one rack as well? In my first problem I
 used two racks, with a custom crushmap, so that the replicas are in the
 two racks (replicationlevel = 2). Than I took one osd down, and expected
 that the remaining osds in this rack would get the now missing replicas
 from the osd of the other rack.
 But nothing happened, the cluster stayed degraded.
 
 -martin
 
 
 On 26.04.2013 02:22, David Zafman wrote:
 
 I filed tracker bug 4822 and have wip-4822 with a fix.  My manual testing 
 shows that it works.  I'm building a teuthology test.
 
 Given your osd tree has a single rack it should always mark OSDs down after 
 5 minutes by default.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 
 
 On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote:
 
 Hi Sage,
 
 On 25.04.2013 18:17, Sage Weil wrote:
 What is the output from 'ceph osd tree' and the contents of your 
 [mon*] sections of ceph.conf?
 
 Thanks!
 sage
 
 
 root@store1:~# ceph osd tree
 
 # idweight  type name   up/down reweight
 -1  24  root default
 -3  24  rack unknownrack
 -2  4   host store1
 0   1   osd.0   up  1   
 1   1   osd.1   down1   
 2   1   osd.2   up  1   
 3   1   osd.3   up  1   
 -4  4   host store3
 10  1   osd.10  up  1   
 11  1   osd.11  up  1   
 8   1   osd.8   up  1   
 9   1   osd.9   up  1   
 -5  4   host store4
 12  1   osd.12  up  1   
 13  1   osd.13  up  1   
 14  1   osd.14  up  1   
 15  1   osd.15  up  1   
 -6  4   host store5
 16  1   osd.16  up  1   
 17  1   osd.17  up  1   
 18  1   osd.18  up  1   
 19  1   osd.19  up  1   
 -7  4   host store6
 20  1   osd.20  up  1   
 21  1   osd.21  up  1   
 22  1   osd.22  up  1   
 23  1   osd.23  up  1   
 -8  4   host store2
 4   1   osd.4   up  1   
 5   1   osd.5   up  1   
 6   1   osd.6   up  1   
 7   1   osd.7   up  1   
 
 
 
 [global]
   auth cluster requierd = none
   auth service required = none
   auth client required = none
 #   log file = 
   log_max_recent=100
   log_max_new=100
 
 [mon]
   mon data = /data/mon.$id
 [mon.a]
   mon host = store1
   mon addr = 192.168.195.31:6789
 [mon.b]
   mon host = store3
   mon addr = 192.168.195.33:6789
 [mon.c]
   mon host = store5
   mon addr = 192.168.195.35:6789
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out

2013-04-26 Thread David Zafman

Mike / Martin,

The OSD down behavior Mike is seeing is different.  You should be seeing 
messages like this in your leader's monitor log:

can_mark_down current up_ratio 0.17  min 0.3, will not mark osd.2 down

To dampen certain kinds of cascading failures, we are deliberately restricting 
automatically marking  30% of OSDs down.

As far as Martin is concerned his osd tree shows a single rack, but said that 
his crush rules are supposed to put a replica on each of 2 racks.  I don't 
remember seeing his crush rules in any of the e-mails, but even so he only has 
unknownrack with id -3 defined.

David Zafman
Senior Developer
http://www.inktank.com




On Apr 26, 2013, at 6:44 AM, Mike Dawson mike.daw...@scholarstack.com wrote:

 David / Martin,
 
 I can confirm this issue. At present I am running monitors only with 100% of 
 my OSD processes shutdown down. For the past couple hours, Ceph has reported:
 
 osdmap e1323: 66 osds: 19 up, 66 in
 
 I can mark them down manually using
 
 ceph osd down 0
 
 as expected, but they never get marked down automatically. Like Martin, I 
 also have a custom crushmap, but this cluster is operating with a single 
 rack. I'll be happy to provide any documentation / configs / logs you would 
 like.
 
 I am currently running ceph version 0.60-666-ga5cade1 
 (a5cade1fe7338602fb2bbfa867433d825f337c87) from gitbuilder.
 
 - Mike
 
 On 4/26/2013 4:50 AM, Martin Mailand wrote:
 Hi David,
 
 did you test it with more than one rack as well? In my first problem I
 used two racks, with a custom crushmap, so that the replicas are in the
 two racks (replicationlevel = 2). Than I took one osd down, and expected
 that the remaining osds in this rack would get the now missing replicas
 from the osd of the other rack.
 But nothing happened, the cluster stayed degraded.
 
 -martin
 
 
 On 26.04.2013 02:22, David Zafman wrote:
 
 I filed tracker bug 4822 and have wip-4822 with a fix.  My manual testing 
 shows that it works.  I'm building a teuthology test.
 
 Given your osd tree has a single rack it should always mark OSDs down after 
 5 minutes by default.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 
 
 On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote:
 
 Hi Sage,
 
 On 25.04.2013 18:17, Sage Weil wrote:
 What is the output from 'ceph osd tree' and the contents of your
 [mon*] sections of ceph.conf?
 
 Thanks!
 sage
 
 
 root@store1:~# ceph osd tree
 
 # id   weight  type name   up/down reweight
 -1 24  root default
 -3 24  rack unknownrack
 -2 4   host store1
 0  1   osd.0   up  1   
 1  1   osd.1   down1   
 2  1   osd.2   up  1   
 3  1   osd.3   up  1   
 -4 4   host store3
 10 1   osd.10  up  1   
 11 1   osd.11  up  1   
 8  1   osd.8   up  1   
 9  1   osd.9   up  1   
 -5 4   host store4
 12 1   osd.12  up  1   
 13 1   osd.13  up  1   
 14 1   osd.14  up  1   
 15 1   osd.15  up  1   
 -6 4   host store5
 16 1   osd.16  up  1   
 17 1   osd.17  up  1   
 18 1   osd.18  up  1   
 19 1   osd.19  up  1   
 -7 4   host store6
 20 1   osd.20  up  1   
 21 1   osd.21  up  1   
 22 1   osd.22  up  1   
 23 1   osd.23  up  1   
 -8 4   host store2
 4  1   osd.4   up  1   
 5  1   osd.5   up  1   
 6  1   osd.6   up  1   
 7  1   osd.7   up  1   
 
 
 
 [global]
auth cluster requierd = none
auth service required = none
auth client required = none
 #   log file = 
log_max_recent=100
log_max_new=100
 
 [mon]
mon data = /data/mon.$id
 [mon.a]
mon host = store1
mon addr = 192.168.195.31:6789
 [mon.b]
mon host = store3
mon addr = 192.168.195.33:6789
 [mon.c]
mon host = store5
mon addr = 192.168.195.35:6789
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message

Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out

2013-04-25 Thread David Zafman

I filed tracker bug 4822 and have wip-4822 with a fix.  My manual testing shows 
that it works.  I'm building a teuthology test.

Given your osd tree has a single rack it should always mark OSDs down after 5 
minutes by default.

David Zafman
Senior Developer
http://www.inktank.com




On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote:

 Hi Sage,
 
 On 25.04.2013 18:17, Sage Weil wrote:
 What is the output from 'ceph osd tree' and the contents of your 
 [mon*] sections of ceph.conf?
 
 Thanks!
 sage
 
 
 root@store1:~# ceph osd tree
 
 # id  weight  type name   up/down reweight
 -124  root default
 -324  rack unknownrack
 -24   host store1
 0 1   osd.0   up  1   
 1 1   osd.1   down1   
 2 1   osd.2   up  1   
 3 1   osd.3   up  1   
 -44   host store3
 101   osd.10  up  1   
 111   osd.11  up  1   
 8 1   osd.8   up  1   
 9 1   osd.9   up  1   
 -54   host store4
 121   osd.12  up  1   
 131   osd.13  up  1   
 141   osd.14  up  1   
 151   osd.15  up  1   
 -64   host store5
 161   osd.16  up  1   
 171   osd.17  up  1   
 181   osd.18  up  1   
 191   osd.19  up  1   
 -74   host store6
 201   osd.20  up  1   
 211   osd.21  up  1   
 221   osd.22  up  1   
 231   osd.23  up  1   
 -84   host store2
 4 1   osd.4   up  1   
 5 1   osd.5   up  1   
 6 1   osd.6   up  1   
 7 1   osd.7   up  1   
 
 
 
 [global]
auth cluster requierd = none
auth service required = none
auth client required = none
 #   log file = 
log_max_recent=100
log_max_new=100
 
 [mon]
mon data = /data/mon.$id
 [mon.a]
mon host = store1
mon addr = 192.168.195.31:6789
 [mon.b]
mon host = store3
mon addr = 192.168.195.33:6789
 [mon.c]
mon host = store5
mon addr = 192.168.195.35:6789
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


.gitignore issues

2013-02-11 Thread David Zafman

After updating to latest master I have the following files listed by git status:

$ git status
# On branch master
# Untracked files:
#   (use git add file... to include in what will be committed)
#
#   src/bench_log
#   src/ceph-filestore-dump
#   src/ceph.conf
#   src/dupstore
#   src/keyring
#   src/kvstorebench
#   src/multi_stress_watch
#   src/omapbench
#   src/psim
#   src/radosacl
#   src/scratchtool
#   src/scratchtoolpp
#   src/smalliobench
#   src/smalliobenchdumb
#   src/smalliobenchfs
#   src/smalliobenchrbd
#   src/streamtest
#   src/testcrypto
#   src/testkeys
#   src/testrados
#   src/testrados_delete_pools_parallel
#   src/testrados_list_parallel
#   src/testrados_open_pools_parallel
#   src/testrados_watch_notify
#   src/testsignal_handlers
#   src/testtimers
#   src/tpbench
#   src/xattr_bench
nothing added to commit but untracked files present (use git add to track)

David Zafman
Senior Developer
david.zaf...@inktank.com



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] two small patches for CEPH wireshark plugin

2013-01-28 Thread David Zafman

You could look at the wip-wireshark-zafman branch.  I rebased it and force 
pushed it.   It has changes to the wireshark.patch and a minor change I needed 
to get it to build.  I'm surprised the recent checkin didn't include the change 
to packet-ceph.c which I needed to get it to build.

David Zafman
Senior Developer
david.zaf...@inktank.com



On Jan 24, 2013, at 12:49 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote:

 Am 24.01.2013 19:31, schrieb Sage Weil:
 Hi Danny!
 [...]
 Since you brought up wireshark...
 
 We would LOVE LOVE LOVE it if this plugin could get upstream into 
 wireshark.  
 
 Yes, this would be great.
 
 IIRC, the problem (last time we checked, ages ago) was that 
 there were strict coding guidelines for that project that weren't 
 followed.  I'm not sure if that is still the case, or even if that is 
 accurate.
 
 It would be great if someone on this list who is looking for a way to 
 contribute could take the lead on trying to make this happen... :-)
 
 I'll take a look at it maybe ... if I find some free time for it.
 
 What about the patches? Can we apply them to the ceph git tree until we
 have another solution for the wireshark code?
 
 Danny
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


master branch issue in ceph.git

2013-01-17 Thread David Zafman

The latest code is hanging trying to start teuthology.  I used teuthology-nuke 
to clear old state and reboot the machines.  I was using my branch rebased to 
latest master and when that started failing I switched to the default config.  
It still keeps hanging here:

INFO:teuthology.task.ceph:Waiting until ceph is healthy...

$ ceph -s
   health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean
   monmap e1: 3 mons at 
{0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, 
election epoch 6, quorum 0,1,2 0,1,2
   osdmap e7: 9 osds: 9 up, 9 in
pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 
798 GB used, 3050 GB / 4055 GB avail
   mdsmap e2: 0/0/0 up

David Zafman
Senior Developer
david.zaf...@inktank.com



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fwd: Interfaces proposed changes

2013-01-07 Thread David Zafman

I sent this proposal out to the developers that own the FSAL CEPH portion of 
Nfs-Ganesha.  They have changes to Ceph that expose additional interfaces for 
this.  This is our initial cut at improving the interfaces.

David Zafman
Senior Developer
david.zaf...@inktank.com


Begin forwarded message:

 From: David Zafman david.zaf...@inktank.com
 Subject: Interfaces proposed changes
 Date: January 4, 2013 5:50:49 PM PST
 To: Matthew W. Benjamin m...@linuxbox.com, Adam C. Emerson 
 aemer...@linuxbox.com
 
 
 Below is a patch that shows the newly proposed low-level interface.  
 Obviously, the ceph_ll_* functions you created in libcephfs.cc will have the 
 corresponding changes made to them.  An Fh * type is used for as an open file 
 descriptor and needs a corresponding ll_release()/ceph_ll_close().  An Inode 
 * returned by various inode create functions and ll_lookup_ino() is a 
 referenced inode and needs a corresponding _ll_put() exposed via something 
 maybe named ceph_ll_put().
 
 The existing FSAL CEPH doesn't ever call ceph_ll_forget() even though there 
 are references taken on inodes ceph ll_* operation level.  This interface 
 creates a clearer model to be used by FSAL CEPH.  As I don't understand 
 Ganesha's inode caching model, it isn't clear to me if it can indirectly 
 hold inodes that are below FSAL.  Especially for NFS v3 where there is no 
 open state, the code shouldn't keep doing final release of an inode after 
 every operation.
 
 diff --git a/src/client/Client.cc b/src/client/Client.cc
 index d876454..4d4d0f1 100644
 --- a/src/client/Client.cc
 +++ b/src/client/Client.cc
 @@ -6250,13 +6250,39 @@ bool Client::ll_forget(vinodeno_t vino, int num)
   return last;
 }
 
 +
 +inodeno_t Client::ll_get_ino(Inode *in)
 +{
 +  return in-ino;
 +}
 +
 +snapid_t Client::ll_get_snapid(Inode *in)
 +{
 +  return in-snapid;
 +}
 +
 +vinodeno_t Client::ll_get_vino(Inode *in)
 +{
 +  return vinodeno_t(in-ino, in-snapid);
 +}
 +
 +Inode *Client::ll_lookup_ino(vinodeno_t vino)
 +{
 +  Mutex::Locker lock(client_lock);
 +  hash_mapvinodeno_t,Inode*::iterator p = inode_map.find(vino);
 +  if (p == inode_map.end())
 +return NULL;
 +  Inode *in = p-second;
 +  _ll_get(in);
 +  return in;
 +}
 +
 Inode *Client::_ll_get_inode(vinodeno_t vino)
 {
   assert(inode_map.count(vino));
   return inode_map[vino];
 }
 
 -
 int Client::ll_getattr(vinodeno_t vino, struct stat *attr, int uid, int gid)
 {
   Mutex::Locker lock(client_lock);
 @@ -7219,7 +7245,7 @@ int Client::ll_release(Fh *fh)
   return 0;
 }
 
 -
 +// --
 
 
 
 diff --git a/src/client/Client.h b/src/client/Client.h
 index 9512a2d..0cfe8d9 100644
 --- a/src/client/Client.h
 +++ b/src/client/Client.h
 @@ -706,6 +706,32 @@ public:
   void ll_register_ino_invalidate_cb(client_ino_callback_t cb, void *handle);
 
   void ll_register_getgroups_cb(client_getgroups_callback_t cb, void *handle);
 +
 +  // low-level interface v2
 +  inodeno_t ll_get_ino(Inode *in);
 +  snapid_t ll_get_snapid(Inode *in);
 +  vinodeno_t ll_get_vino(Inode *in);
 +  Inode *ll_lookup_ino(vinodeno_t vino);
 +  int ll_lookup(Inode *parent, const char *name, struct stat *attr, Inode 
 **out, int uid = -1, int gid = -1);
 +  bool ll_forget(Inode *in, int count);
 +  int ll_getattr(Inode *in, struct stat *st, int uid = -1, int gid = -1);
 +  int ll_setattr(Inode *in, struct stat *st, int mask, int uid = -1, int gid 
 = -1);
 +  int ll_getxattr(Inode *in, const char *name, void *value, size_t size, int 
 uid=-1, int gid=-1);
 +  int ll_setxattr(Inode *in, const char *name, const void *value, size_t 
 size, int flags, int uid=-1, int gid=-1);
 +  int ll_removexattr(Inode *in, const char *name, int uid=-1, int gid=-1);
 +  int ll_listxattr(Inode *in, char *list, size_t size, int uid=-1, int 
 gid=-1);
 +  int ll_opendir(Inode *in, void **dirpp, int uid = -1, int gid = -1);
 +  int ll_readlink(Inode *in, const char **value, int uid = -1, int gid = -1);
 +  int ll_mknod(Inode *in, const char *name, mode_t mode, dev_t rdev, struct 
 stat *attr, Inode **out, int uid = -1, int gid = -1);
 +  int ll_mkdir(Inode *in, const char *name, mode_t mode, struct stat *attr, 
 Inode **out, int uid = -1, int gid = -1);
 +  int ll_symlink(Inode *in, const char *name, const char *value, struct stat 
 *attr, Inode **out, int uid = -1, int gid = -1);
 +  int ll_unlink(Inode *in, const char *name, int uid = -1, int gid = -1);
 +  int ll_rmdir(Inode *in, const char *name, int uid = -1, int gid = -1);
 +  int ll_rename(Inode *parent, const char *name, Inode *newparent, const 
 char *newname, int uid = -1, int gid = -1);
 +  int ll_link(Inode *in, Inode *newparent, const char *newname, struct stat 
 *attr, int uid = -1, int gid = -1);
 +  int ll_open(Inode *in, int flags, Fh **fh, int uid = -1, int gid = -1);
 +  int ll_create(Inode *parent, const char *name, mode_t mode, int flags, 
 struct stat *attr, Inode **out, int uid = -1, int gid = -1);
 +  int ll_statfs(Inode *in, struct statvfs *stbuf

Re: [PATCH REPOST 0/4] rbd: four minor patches

2013-01-03 Thread David Zafman

I reviewed these.

Reviewed-by: David Zafman david.zaf...@inktank.com

David Zafman
Senior Developer
david.zaf...@inktank.com



On Jan 3, 2013, at 11:04 AM, Alex Elder el...@inktank.com wrote:

 I'm re-posting my patch backlog, in chunks that may or may not
 match how they got posted before.  This series contains some
 pretty fairly straightforward changes.
 
   -Alex
 
 [PATCH REPOST 1/4] rbd: document rbd_spec structure
 [PATCH REPOST 2/4] rbd: kill rbd_spec-image_name_len
 [PATCH REPOST 3/4] rbd: kill rbd_spec-image_id_len
 [PATCH REPOST 4/4] rbd: use kmemdup()
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


testing branch of ceph-client repo was force pushed

2012-12-08 Thread David Zafman

I amended the last 5 commits which I committed to the testing branch last 
night.  Please update your repositories accordingly.

David--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 0.55 init script Issue?

2012-12-05 Thread David Zafman

Keep in mind that some of the init.d stuff doesn't work with a ceph-deploy 
installed system.  Not clear to me if we need to fix ceph-deploy or for those 
type of setups only upstart should be used/available.

David

On Dec 5, 2012, at 11:41 AM, Dan Mick dan.m...@inktank.com wrote:

 The story as best I know it is that we're trying to transition to and use 
 upstart where possible, but that the upstart config does not (yet?) try to do 
 what the init.d config did.  That is, it doesn't support options to the one 
 script, but rather separates daemons into separate services, and does not 
 reach out to remote machines to start daemons, etc.
 
 The intent is that init.d/ceph is left for non-Upstart distros, AFAICT.
 
 Tv had some design notes here:
 
 http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09314.html
 
 We need better documentation/rationale here at least.
 
 
 
 On 12/05/2012 08:15 AM, Mike Dawson wrote:
 All,
 
 After upgrading from 0.54 to 0.55, the command service ceph start
 fails. But /etc/init.d/ceph start works. This is the case for start.
 stop, etc. Here is an example:
 
 root@node2:~# /etc/init.d/ceph stop
 === mon.a ===
 Stopping Ceph mon.a on node2...kill 2505...done
 === osd.0 ===
 Stopping Ceph osd.0 on node2...kill 5042...done
 === osd.1 ===
 Stopping Ceph osd.1 on node2...kill 5116...done
 === osd.17 ===
 Stopping Ceph osd.17 on node2...kill 5275...done
 
 
 root@node2:~# service ceph start
 start: Job is already running: ceph
 
 
 root@node2:~# /etc/init.d/ceph start
 === mon.a ===
 Starting Ceph mon.a on node2...
 starting mon.a rank 0 at 172.16.1.2:6789/0 mon_data
 /var/lib/ceph/mon/ceph-a fsid 4951e786-945e-47b6-b1b1-4043b6cc3b55
 === osd.0 ===
 Starting Ceph osd.0 on node2...
 starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /dev/sda6
 === osd.1 ===
 Starting Ceph osd.1 on node2...
 starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /dev/sda7
 === osd.17 ===
 Starting Ceph osd.17 on node2...
 starting osd.17 at :/0 osd_data /var/lib/ceph/osd/ceph-17 /dev/sda8
 
 
 This is Ubuntu 12.10 with packages from debian-testing. One other user
 on IRC confirmed the same behavior. Is this a known issue?
 
 
 Thanks,
 Mike Dawson
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:

 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows the
 mtime to be based off the mds time, so it will be consistent across clients
 and the mds.  It does however, allow a client to set an mtime to the future
 (based off of its local time), which might be undesirable, but that is more
 like how  NFS behaves.  Message latency probably won't be much of an issue
 either, as the granularity of mtime is a second. Also, the client can set its
 cap acquired timestamp to the time at which the cap was requested, ensuring
 that the relative increment includes the round trip latency so that the mtime
 will always be set further ahead. Of course, this approach would be a lot 
 more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite clock 
 skew is a good goal.  That will make things like pjd's test behave 
 consistently, for example.
 

My suggestion is that a client writing to a file will try to use it's local 
clock unless it would cause the mtime to go backward.  In that case it will 
simply perform the minimum mtime advance possible (1 second?).  This handles 
the case in which one client created a file using his clock (per previous 
suggested change), then another client writes with a clock that is behind.

David

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote:

 On 11/27/2012 12:01 PM, Sage Weil wrote:
 On Tue, 27 Nov 2012, David Zafman wrote:
 
 On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
 
 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows 
 the
 mtime to be based off the mds time, so it will be consistent across 
 clients
 and the mds.  It does however, allow a client to set an mtime to the 
 future
 (based off of its local time), which might be undesirable, but that is 
 more
 like how  NFS behaves.  Message latency probably won't be much of an issue
 either, as the granularity of mtime is a second. Also, the client can set 
 its
 cap acquired timestamp to the time at which the cap was requested, 
 ensuring
 that the relative increment includes the round trip latency so that the 
 mtime
 will always be set further ahead. Of course, this approach would be a lot 
 more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite clock
 skew is a good goal.  That will make things like pjd's test behave
 consistently, for example.
 
 
 My suggestion is that a client writing to a file will try to use it's
 local clock unless it would cause the mtime to go backward.  In that
 case it will simply perform the minimum mtime advance possible (1
 second?).  This handles the case in which one client created a file
 using his clock (per previous suggested change), then another client
 writes with a clock that is behind.
 
 We can choose to not decrement at the client, but because mtime is a time_t 
 (seconds since epoch), we can't increment by 1 for each write. 1000 writes 
 each taking 0.01s would move the mtime 990 seconds into the future.

The mtime update shouldn't work that way (see below).

 
 
 That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
 what POSIX says about that, though: if you utimes(2) an mtime into the
 future, what happens on write(2)?

On ext4 a write(2) after mtime set into the future with utimes(2) does the time 
go backward.  However, we can notice that if ctime == mtime then only 
create/write/truncate has last been done to the file.  This means that we 
should not let the mtime go backward in that case.  If the ctime != mtime, then 
the mtime has been set by utimes(2), so we can set mtime using our clock even 
if it goes backwards.

 
 According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only 
 require an update to mtime, it doesn't specify what the update should be:
 
 Upon successful completion, where nbyte is greater than 0, write() shall 
 mark for update the st_ctime and st_mtime fields of the file, and if the file 
 is a regular file, the S_ISUID and S_ISGID bits of the file mode may be 
 cleared.

What this really means is that all writes mark mtime for update but not setting 
a specific time in the inode yet.  All writes/truncates will be rolled into a 
single mtime bump.  So even if we only have 1 second granularity (but hopefully 
it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or 
returning capabilities) only then does a new mtime need to be set and it will 
be at most 1 second ahead.

 
 In NFS, the server sets the mtime.  Its relatively common to see Warning: 
 file 'foo' has modification time in the future if you're compiling on nfs 
 and your client and nfs server clocks are skewed.  So allowing the mtime to 
 be set in the near future would at least follow the principle of least 
 surprise for most folks.

So Ceph can see this warning too if different skewed clocks are setting mtime 
and it appears in the future to some clients.

 
 -sam
 
 
 sage
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hadoop and Ceph client/mds view of modification time

2012-11-27 Thread David Zafman

On Nov 27, 2012, at 1:14 PM, Sam Lang sam.l...@inktank.com wrote:

 On 11/27/2012 01:38 PM, David Zafman wrote:
 
 On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote:
 
 On 11/27/2012 12:01 PM, Sage Weil wrote:
 On Tue, 27 Nov 2012, David Zafman wrote:
 
 On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote:
 
 On Tue, 27 Nov 2012, Sam Lang wrote:
 
 3. When a client acquires the cap for a file, have the mds provide its 
 current
 time as well.  As the client updates the mtime, it uses the timestamp 
 provided
 by the mds and the time since the cap was acquired.
 Except for the skew caused by the message latency, this approach allows 
 the
 mtime to be based off the mds time, so it will be consistent across 
 clients
 and the mds.  It does however, allow a client to set an mtime to the 
 future
 (based off of its local time), which might be undesirable, but that is 
 more
 like how  NFS behaves.  Message latency probably won't be much of an 
 issue
 either, as the granularity of mtime is a second. Also, the client can 
 set its
 cap acquired timestamp to the time at which the cap was requested, 
 ensuring
 that the relative increment includes the round trip latency so that the 
 mtime
 will always be set further ahead. Of course, this approach would be a 
 lot more
 intrusive to implement. :-)
 
 Yeah, I'm less excited about this one.
 
 I think that giving consistent behavior from a single client despite 
 clock
 skew is a good goal.  That will make things like pjd's test behave
 consistently, for example.
 
 
 My suggestion is that a client writing to a file will try to use it's
 local clock unless it would cause the mtime to go backward.  In that
 case it will simply perform the minimum mtime advance possible (1
 second?).  This handles the case in which one client created a file
 using his clock (per previous suggested change), then another client
 writes with a clock that is behind.
 
 We can choose to not decrement at the client, but because mtime is a time_t 
 (seconds since epoch), we can't increment by 1 for each write. 1000 writes 
 each taking 0.01s would move the mtime 990 seconds into the future.
 
 The mtime update shouldn't work that way (see below).
 
 
 
 That's a possibility (if it's 1ms or 1ns, at least :). We need to verify
 what POSIX says about that, though: if you utimes(2) an mtime into the
 future, what happens on write(2)?
 
 On ext4 a write(2) after mtime set into the future with utimes(2) does the 
 time go backward.  However, we can notice that if ctime == mtime then only 
 create/write/truncate has last been done to the file.  This means that we 
 should not let the mtime go backward in that case.  If the ctime != mtime, 
 then the mtime has been set by utimes(2), so we can set mtime using our 
 clock even if it goes backwards.
 
 I'm not sure I follow you here.  utimes(2) can set mtime and ctime to same, 
 different, set mtime and/or ctime to current time.  That makes it hard to 
 rely on the mtime != ctime conditional.

utimes(2) does not allow you to modify ctime.  As a matter of fact if you set 
mtime, ctime will always be set to localtime.  On a single system with only a 
forward moving clock, ctime can never go backwards nor will ever look like it 
is in the future.  Also, when setting mtime to now it is the case that ctime 
== mtime.  Ceph should insure this ctime only moves forward.  Unfortunately, it 
can't do that and prevent ctime from looking like it is in the future, but NFS 
doesn't either because the ctime is always set by the NFS server clock.

ubuntu@client:~$ date ; touch file ; stat file; sleep 60; date ; echo foo  
file ; stat file ; sleep 15; date ; touch -m -t 11281200 file ; stat file ; 
sleep 15 ; date ; touch -m file ; stat file
###CREATE (atime ==  mtime == ctime)
Tue Nov 27 13:44:23 PST 2012
  File: `file'
  Size: 4   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-27 13:44:23.685986005 -0800
Change: 2012-11-27 13:44:23.685986005 -0800
 Birth: -
###WRITE (mtime == ctime)  advanced
Tue Nov 27 13:45:23 PST 2012
  File: `file'
  Size: 8   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-27 13:45:23.701986456 -0800
Change: 2012-11-27 13:45:23.701986456 -0800
 Birth: -
UTIMES(2)  mtime in the future, ctime set to local clock
Tue Nov 27 13:45:38 PST 2012
  File: `file'
  Size: 8   Blocks: 8  IO Block: 4096   regular file
Device: 805h/2053d  Inode: 145126  Links: 1
Access: (0664/-rw-rw-r--)  Uid: ( 1000/  ubuntu)   Gid: ( 1000/  ubuntu)
Access: 2012-11-27 13:44:23.685986005 -0800
Modify: 2012-11-28 12:00:00.0

Re: getting kernel debug output

2012-10-24 Thread David Zafman

I also added a kcon_most teuthology task which does almost the same thing as 
ceph/src/script/kcon_most.sh to all or any set of clients.  The teuthology 
version does not raise the console log level.

For example:

tasks:
- ceph:
- kclient:
- kcon_most:
- interactive:


On Oct 24, 2012, at 11:14 AM, Alex Elder el...@inktank.com wrote:

 On 10/24/2012 12:11 PM, Sage Weil wrote:
 I'm working on http://tracker.newdream.net/issues/3342 and was able to 
 reproduce the msgr bug (some annoying msgr race I think) while generating 
 full libceph debug output.  I used a teuthology yaml fragment like so:
 
 I have more trouble than that, but perhaps there's something
 weird about having my serial console connected from 1500 miles
 away.  I'm impressed full debugging didn't mess things up.
 
 tasks:
 - clock: null
 - ceph:
log-whitelist:
- wrongly marked me down
- objects unfound and apparently lost
 - thrashosds: null
 - kclient: null
 - exec:
client.0:
  - echo 'module libceph +p'  /sys/kernel/debug/dynamic_debug/control
 
 This is cool, I didn't know you could do this.
 
 - workunit:
clients:
  all:
  - suites/ffsb.sh
 
 I was pleasantly surprised that even though this is putting copious 
 amounts of crap in dmesg it didn't slow things down enough to avoid 
 tripping the bug.  And the 'dmesg' command in kdb appears to be working 
 now (a couple months back it wasn't).  Yay!
 
 For me, dmesg has been working, but I'd like to know how to
 truncate the output to just, say, the last 200 lines.  (Maybe
 there is one.)
 
 Anyway, this might be useful in tracking down other bugs as well...
 
 Yes, this is good news.
 
   -Alex
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html