Re: new OSD re-using old OSD id fails to boot
On 12/9/15 2:39 AM, Wei-Chung Cheng wrote: Hi Loic, I try to reproduce this problem on my CentOS7. I can not do the same issue. This is my version: ceph version 10.0.0-928-g8eb0ed1 (8eb0ed1dcda9ee6180a06ee6a4415b112090c534) Would you describe more detail? Hi David, Sage, In most of time, when we found the osd failure, the OSD is already in `out` state. It could not avoid the redundant data movement unless we could set the osd noout when failure. Is it right? (Means if OSD go into `out` state, it will make some redundant data movement) Yes, one case would be that during the 5 minute down window of an OSD disk failure, the noout flag can be set if a spare disk is available. Another scenario would be a bad SMART status or noticing EIO errors from a disk prompting a replacement. So if a spare disk is already installed or you have hot swappable drives, it would be nice to replace the drive and let recovery copy back all the data that should be there. Using noout would be critical to this effort. I don't understand why Sage suggests below that a down+out phase would be required during the replacement. Could we try the traditional spare behavior? (Set some disks backup and auto replace the broken device?) That can replace the failure osd before it go into the `out` state. Or we could always set the osd noout? In fact, I think these is a different problems between David and Loic. (these two problems are the same import :p If you have any problems, feel free to let me know. thanks!! vicente 2015-12-09 10:50 GMT+08:00 Sage Weil <sw...@redhat.com>: On Tue, 8 Dec 2015, David Zafman wrote: Remember I really think we want a disk replacement feature that would retain the OSD id so that it avoids unnecessary data movement. See tracker http://tracker.ceph.com/issues/13732 Yeah, I totally agree. We just need to form an opinion on how... probably starting with the user experience. Ideally we'd go from up + in to down + in to down + out, then pull the drive and replace, and then initialize a Here new OSD with the same id... and journal partition. Something like ceph-disk recreate id=N uuid=U I.e., it could use the uuid (which the cluster has in the OSDMap) to find (and re-use) the journal device. For a journal failure it'd probably be different.. but maybe not? Any other ideas? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new OSD re-using old OSD id fails to boot
Remember I really think we want a disk replacement feature that would retain the OSD id so that it avoids unnecessary data movement. See tracker http://tracker.ceph.com/issues/13732 David On 12/5/15 8:49 AM, Loic Dachary wrote: Hi Sage, The problem described at "new OSD re-using old OSD id fails to boot" http://tracker.ceph.com/issues/13988 consistently fails the ceph-disk suite on master. I wonder if it could be a side effect of the recent optimizations introduced in the monitor ? Cheers -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 答复: [ceph-users] How long will the logs be kept?
dout() is used for an OSD to log information about what it is doing locally and might become very chatty. It is saved on the local nodes disk only. clog is the cluster log and is used for major events that should be known by the administrator (see ceph -w). Clog should be used sparingly as it sends the messages to the monitor. David On 12/3/15 4:36 AM, Wukongming wrote: OK! One more question. Do you know why ceph has 2 ways outputting logs(dout && clog). Cause I find dout is more helpful than clog, Did ceph use clog first, and dout added for new version? - wukongming ID: 12019 Tel:0571-86760239 Dept:2014 UIS2 ONEStor -邮件原件- 发件人: Jan Schermer [mailto:j...@schermer.cz] 发送时间: 2015年12月3日 16:58 收件人: wukongming 12019 (RD) 抄送: huang jun; ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com 主题: Re: [ceph-users] How long will the logs be kept? You can setup logrotate however you want - not sure what the default is for your distro. Usually logrotate doesn't touch files that are smaller than some size even if they are old. It will also not delete logs for OSDs that no longer exist. Ceph itself has nothing to do with log rotation, logrotate does the work. Ceph packages likely contain default logrotate rules for the logs but you can edit them to your liking. Jan On 03 Dec 2015, at 09:38, Wukongmingwrote: Yes, I can find ceph of rotate configure file in the directory of /etc/logrotate.d. Also, I find sth. Weird. drwxr-xr-x 2 root root 4.0K Dec 3 14:54 ./ drwxrwxr-x 19 root syslog 4.0K Dec 3 13:33 ../ -rw--- 1 root root 0 Dec 2 06:25 ceph.audit.log -rw--- 1 root root85K Nov 25 09:17 ceph.audit.log.1.gz -rw--- 1 root root 228K Dec 3 16:00 ceph.log -rw--- 1 root root28K Dec 3 06:23 ceph.log.1.gz -rw--- 1 root root 374K Dec 2 06:22 ceph.log.2.gz -rw-r--r-- 1 root root 4.3M Dec 3 16:01 ceph-mon.wkm01.log -rw-r--r-- 1 root root 561K Dec 3 06:25 ceph-mon.wkm01.log.1.gz -rw-r--r-- 1 root root 2.2M Dec 2 06:25 ceph-mon.wkm01.log.2.gz -rw-r--r-- 1 root root 0 Dec 2 06:25 ceph-osd.0.log -rw-r--r-- 1 root root992 Dec 1 09:09 ceph-osd.0.log.1.gz -rw-r--r-- 1 root root19K Dec 3 10:51 ceph-osd.2.log -rw-r--r-- 1 root root 2.3K Dec 2 10:50 ceph-osd.2.log.1.gz -rw-r--r-- 1 root root27K Dec 1 10:31 ceph-osd.2.log.2.gz -rw-r--r-- 1 root root13K Dec 3 10:23 ceph-osd.5.log -rw-r--r-- 1 root root 1.6K Dec 2 09:57 ceph-osd.5.log.1.gz -rw-r--r-- 1 root root22K Dec 1 09:51 ceph-osd.5.log.2.gz -rw-r--r-- 1 root root19K Dec 3 10:51 ceph-osd.8.log -rw-r--r-- 1 root root18K Dec 2 10:50 ceph-osd.8.log.1 -rw-r--r-- 1 root root 261K Dec 1 13:54 ceph-osd.8.log.2 I deployed ceph cluster on Nov 21, from that day to Dec.1, I mean the continue 10 days' logs were compressed into one file, it is not what I want. Does any OP affect log compressing? Thanks! Kongming Wu - wukongming ID: 12019 Tel:0571-86760239 Dept:2014 UIS2 ONEStor -邮件原件- 发件人: huang jun [mailto:hjwsm1...@gmail.com] 发送时间: 2015年12月3日 13:19 收件人: wukongming 12019 (RD) 抄送: ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com 主题: Re: How long will the logs be kept? it will rotate every week by default, you can see the logrotate file /etc/ceph/logrotate.d/ceph 2015-12-03 12:37 GMT+08:00 Wukongming : Hi ,All Is there anyone who knows How long or how many days will the logs.gz (mon/osd/mds)be kept, maybe before flushed? - wukongming ID: 12019 Tel:0571-86760239 Dept:2014 UIS2 OneStor - - --- 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -- thanks huangjun ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com N�r��y���b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!tml= -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Error handling during recovery read
I can't remember the details now, but I know that recovery needed additional work. If it were a simple fix I would have done it when implementing that code. I found this bug related to recovery and ec errors (http://tracker.ceph.com/issues/13493) BUG #13493: osd: for ec, cascading crash during recovery if one shard is corrupted David On 12/4/15 2:03 AM, Markus Blank-Burian wrote: Hi David, I am using ceph 9.2.0 with an erasure coded pool and have some problems with missing objects. Reads for degraded/backfilling objects on an EC pool, which detect an error (-2 in my case) seem to be aborted immediately instead of reading from the remaining shards. Why is there an explicit check for "!rop.for_recovery" in ECBackend::handle_sub_read_reply? Would it be possible to remove this check and let the recovery read be completed from the remaining good shards? Markus -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD replacement feature
That is correct. The goal is to only refill the replacement OSD disk. Otherwise, if the OSD is only down for less than mon_osd_down_out_interval (5 min default) or noout is set, no other data movement would occur. David On 11/23/15 8:45 PM, Wei-Chung Cheng wrote: 2015-11-21 1:54 GMT+08:00 David Zafman <dzaf...@redhat.com>: There are two reasons for having a ceph-disk replace feature. 1. To simplify the steps required to replace a disk 2. To allow a disk to be replaced proactively without causing any data movement. Hi David, It good to without causing any data movement when we want to replaced failure osd. But I don't have any idea to complete it, could you give some opinions? I though if we want to replace failure we must move the object data on failure osd to new(replacement) osd? Or I got some misunderstanding? thanks!!! vicente So keeping the osd id the same is required and is what motivated the feature for me. David On 11/20/15 3:38 AM, Sage Weil wrote: On Fri, 20 Nov 2015, Wei-Chung Cheng wrote: Hi Loic and cephers, Sure, I have time to help (comment) on this feature replace a disk. This is a useful feature to handle disk failure :p An simple step is described on http://tracker.ceph.com/issues/13732 : 1. set noout flag - if the broken osd is primary osd, could we handle well? 2. stop osd daemon and we need to wait the osd actually down. (or maybe use deactivate option with ceph-disk) these two above step seems OK. about handle crush map, should we remove the broken osd out? If we do that, why we set noout flag? It still trigger re-balance after we remove osd from crushmap. Right--I think you generally want to do either one or the other: 1) mark osd out, leave failed disk in place. or, replace with new disk that re-uses the same osd id. or, 2) remove osd from crush map. replace with new disk (which gets new osd id). I think re-using the osd id is awkward currently, so doing 1 and replacing the disk ends up moving data twice. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD replacement feature
There are two reasons for having a ceph-disk replace feature. 1. To simplify the steps required to replace a disk 2. To allow a disk to be replaced proactively without causing any data movement. So keeping the osd id the same is required and is what motivated the feature for me. David On 11/20/15 3:38 AM, Sage Weil wrote: On Fri, 20 Nov 2015, Wei-Chung Cheng wrote: Hi Loic and cephers, Sure, I have time to help (comment) on this feature replace a disk. This is a useful feature to handle disk failure :p An simple step is described on http://tracker.ceph.com/issues/13732 : 1. set noout flag - if the broken osd is primary osd, could we handle well? 2. stop osd daemon and we need to wait the osd actually down. (or maybe use deactivate option with ceph-disk) these two above step seems OK. about handle crush map, should we remove the broken osd out? If we do that, why we set noout flag? It still trigger re-balance after we remove osd from crushmap. Right--I think you generally want to do either one or the other: 1) mark osd out, leave failed disk in place. or, replace with new disk that re-uses the same osd id. or, 2) remove osd from crush map. replace with new disk (which gets new osd id). I think re-using the osd id is awkward currently, so doing 1 and replacing the disk ends up moving data twice. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pg scrub check problem
Initiating a manual deep-scrub like you are doing should always run. The command you are running doesn't report any information it just initiates a background process. If you follow the command with ceph -w you'll see what is happening: After I corrupted one of my replicas I see this. $ ceph pg deep-scrub 1.6; ceph -w instructing pg 1.6 on osd.3 to deep-scrub cluster 8528c83b-0ff9-479c-af76-fc0ac5c595d3 health HEALTH_OK monmap e1: 1 mons at {a=127.0.0.1:6789/0} election epoch 2, quorum 0 a osdmap e14: 4 osds: 4 up, 4 in flags sortbitwise pgmap v29: 16 pgs, 2 pools, 1130 bytes data, 1 objects 83917 MB used, 30311 MB / 117 GB avail 16 active+clean 2015-10-28 12:23:17.724011 mon.0 [INF] from='client.? 127.0.0.1:0/3672629479' entity='client.admin' cmd=[{"prefix": "pg deep-scrub", "pgid": "1.6"}]: dispatch 2015-10-28 12:23:19.787756 mon.0 [INF] pgmap v30: 16 pgs: 1 active+clean+inconsistent, 15 active+clean; 1130 bytes data, 83917 MB used, 30310 MB / 117 GB avail 2015-10-28 12:23:18.274239 osd.3 [INF] 1.6 deep-scrub starts 2015-10-28 12:23:18.277332 osd.3 [ERR] 1.6 shard 2: soid 1/7fc1f406/foo/head data_digest 0xe84d3cdc != known data_digest 0x74d68469 from auth shard 0, size 7 != known size 1130 2015-10-28 12:23:18.277546 osd.3 [ERR] 1.6 deep-scrub 0 missing, 1 inconsistent objects 2015-10-28 12:23:18.277549 osd.3 [ERR] 1.6 deep-scrub 1 errors ^C David On 10/28/15 3:34 AM, 池信泽 wrote: Are you sure the osd begin to scrub? maybe you could check it from osd log, or using 'ceph pg dump' to check whether the scrub stamp changes or not. Because there is some strategy which would reject the scrub command Such as the system load , osd_scrub_min_interval, osd_deep_scrub_interval and so on 2015-10-28 17:39 GMT+08:00 changtao381: Hi, I’m testing the deep-scrub function of ceph. And the test steps are below : 1) I put an object on ceph using command : rados put test.txt test.txt –p testpool The size of testpool is 3, so there three replicates on three osds: osd.0: /data1/ceph_data/osd.0/current/1.0_head/test.txt__head_8B0B6108__1 osd.1: /data2/ceph_data/osd.1/current/1.0_head/test.txt__head_8B0B6108__1 osd.2/data3/ceph_data/osd.2/current/1.0_head/test.txt__head_8B0B6108__1 2) I modified the content of one replica on osd.0 using vim editor directly on disk 3) I run the command ceph pg deep-scrub 1.0 and expect it can check the inconsistent error out, but it fails. It doesn’t find the error why? Any suggestions will be appreciated! Thanks -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pg scrub check problem
Good point. In my previous response I did "echo garbage > ./foo__head_7FC1F406__1" to corrupt a replica. David On 10/28/15 5:13 PM, Sage Weil wrote: Becuse you *just* wrote the object, and the FileStore caches open file handles. Vim renames a new inode over the old one so the open inode is untouched. If you restart the osd and then scrub you'll see the error. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-addr
I don't understand how encode/decode of entity_addr_t is changing without versioning in the encode/decode. This means that this branch is changing the ceph-objectstore-tool export format if CEPH_FEATURE_MSG_ADDR2 is part of the features. So we could bump super_header::super_ver if the export format must change. Now that I look at it, I'm sure I can clear the watchers and old_watchers in object_info_t during export because that is dynamic information and it happens to include entity_addr_t. I need to verify this, but that may be the only reason that the objectstore tool needs a valid features value to be passed there. David On 10/9/15 2:49 PM, Sage Weil wrote: 2. >(about line 2067 in src/tools/ceph_objectstore_tool.cc) >(use via ceph cmd?) tools - "object store tool". >This has a way to serialize objects which includes a watch list >which includes an address. There should be an option here to say >whether to include exported addresses. I think it's safe to use defaults here.. what do you think, David? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] O_DIRECT on deep-scrub read
There would be a benefit to doing fadvise POSIX_FADV_DONTNEED after deep-scrub reads for objects not recently accessed by clients. I see the NewStore objectstore sometimes using the O_DIRECT flag for writes. This concerns me because the open(2) man pages says: "Applications should avoid mixing O_DIRECT and normal I/O to the same file, and especially to overlapping byte regions in the same file. Even when the filesystem correctly handles the coherency issues in this situation, overall I/O throughput is likely to be slower than using either mode alone." David On 10/7/15 7:50 AM, Sage Weil wrote: It's not, but it would not be ahrd to do this. There are fadvise-style hints being passed down that could trigger O_DIRECT reads in this case. That may not be the best choice, though--it won't use data that happens to be in cache and it'll also throw it out.. On Wed, 7 Oct 2015, Pawe? Sadowski wrote: Hi, Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm not able to verify that in source code. If not would it be possible to add such feature (maybe config option) to help keeping Linux page cache in better shape? Thanks, -- PS ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Tracker 12577 repair won't fix replica with bad digest
Sage, I restored the branch wip-digest-repair which merged post-hammer in pull request #4365. Do you think that 4365 fixes the reported bug #12577? I cherry-picked the 9 commits off of hammer-backports-next as pull request #5458 and assigned to Loic. David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph-objectstore-tool import failures
I'm going to skip exporting of temp objects in a new wip-temp-zafman branch.Also, when we have persistent-temp objects, we'll probably need to enhance object_locator_to_pg() to adjust for negative pool numbers. David On 7/7/15 10:34 AM, Samuel Just wrote: In the sense that the osd will still clear them, sure. I've changed my mind though, probably best to not import or export them for now, and update the code to handle the persistent-temp objects when they exist (by looking at the hash). We don't record anything about the in progress push, so the recovery temp objects at least aren't valuable to keep around. -Sam - Original Message - From: Sage Weil sw...@redhat.com To: Samuel Just sj...@redhat.com Cc: David Zafman dzaf...@redhat.com, ceph-devel@vger.kernel.org Sent: Tuesday, July 7, 2015 10:22:32 AM Subject: Re: ceph-objectstore-tool import failures On Tue, 7 Jul 2015, Samuel Just wrote: If we think we'll want to persist some temp objects later on, probably better to go ahead and export/import them now. Replay isn't relevant here since it happens at a lower level. The ceph_objectstore_tool does do a kind of split during import since it needs to be able to handle the case where the pg was split between the import and the export. In the event that temp objects need to persist across intervals, we'll have to solve the problem of splitting the temp objects in the osd as well as in the objectstore tool -- probably by creating a class of persistent temp objects with non-fake hashes taken from the corresponding non-temp object. Yeah.. I suspect the right thing to do is make the temp object hash match the eventual target hash. We can do this now for the temp recovery objects (even though they'll be deleted by the OSD). Presumably the same trick will work for recorded transaction objects too, or whatever else... In any case, for now the cot split can just look at hash like it does with the non-temp objects and we're good, right? sage -Sam - Original Message - From: Sage Weil sw...@redhat.com To: David Zafman dzaf...@redhat.com Cc: sj...@redhat.com, ceph-devel@vger.kernel.org Sent: Tuesday, July 7, 2015 10:00:09 AM Subject: Re: ceph-objectstore-tool import failures On Mon, 6 Jul 2015, David Zafman wrote: Why import temp objects when clear_temp_objects() will just remove it on osd start-up? For now we could get away with skipping them, but I suspect in the future there will be cases where we want to preserve them across restarts (for example, when recording multi-object transactions that are not yet committed). If we need the temp objects for replay purposes, does it matter if a split has occurred after the original export happened? The replay should happen before the export... it's below the ObjectStore interface, so I don't think it matters here. I'm not sure about the split implications, though. Does the export/import have to do a split, or does it let the OSD do that after it's imported? sage Or can we just import all temporary objects without regards to split and assume that after replay the clear_temp_objects() will clean them up? David On 7/6/15 1:28 PM, Sage Weil wrote: On Fri, 19 Jun 2015, David Zafman wrote: This ghobject_t which has a pool of -3 is part of the export. This caused the assert: Read -3/1c/temp_recovering_1.1c_33'50_39_head/head This was added by osd: use per-pool temp poolid for temp objects 18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch. You should skip it on export or recreate it on import with special handling. Ah, that makes sense. I think we should include these temp objects in the export, though, and make cot understand that they are part of the pool. We moved the clear temp objects on startup logic into teh OSD, which I think will be useful for e.g. multiobject transactions (where we'll want some objects that are internal/hidden to persist across peering intervals and restarts). Looking at your wip-temp-zafman, I think the first patch needs to be dropped: include the temp objects, and I assume the meta one (which has the pg log and other critical pg metadata). Not sure where to change cot to handle the temp objects though? Thanks! sage David On 6/19/15 7:38 PM, David Zafman wrote: Have not seen this as an assert before. Given the code below in do_import() of master branch the assert is impossible (?). if (!curmap.have_pg_pool(pgid.pgid.m_pool)) { cerr Pool pgid.pgid.m_pool no longer exists std::endl; // Special exit code for this error, used by test code return 10; // Positive return means exit status } David On 6/19/15 7:25 PM, Sage Weil wrote: Hey David, On this run /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648 ceph-objectstore-tool is failing to import a pg because the pool doesn't exist. It looks like the thrasher is doing an export+import and racing with a test that is tearing down a pool. The crash is ceph version
Re: ceph-objectstore-tool import failures
Why import temp objects when clear_temp_objects() will just remove it on osd start-up? If we need the temp objects for replay purposes, does it matter if a split has occurred after the original export happened? Or can we just import all temporary objects without regards to split and assume that after replay the clear_temp_objects() will clean them up? David On 7/6/15 1:28 PM, Sage Weil wrote: On Fri, 19 Jun 2015, David Zafman wrote: This ghobject_t which has a pool of -3 is part of the export. This caused the assert: Read -3/1c/temp_recovering_1.1c_33'50_39_head/head This was added by osd: use per-pool temp poolid for temp objects 18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch. You should skip it on export or recreate it on import with special handling. Ah, that makes sense. I think we should include these temp objects in the export, though, and make cot understand that they are part of the pool. We moved the clear temp objects on startup logic into teh OSD, which I think will be useful for e.g. multiobject transactions (where we'll want some objects that are internal/hidden to persist across peering intervals and restarts). Looking at your wip-temp-zafman, I think the first patch needs to be dropped: include the temp objects, and I assume the meta one (which has the pg log and other critical pg metadata). Not sure where to change cot to handle the temp objects though? Thanks! sage David On 6/19/15 7:38 PM, David Zafman wrote: Have not seen this as an assert before. Given the code below in do_import() of master branch the assert is impossible (?). if (!curmap.have_pg_pool(pgid.pgid.m_pool)) { cerr Pool pgid.pgid.m_pool no longer exists std::endl; // Special exit code for this error, used by test code return 10; // Positive return means exit status } David On 6/19/15 7:25 PM, Sage Weil wrote: Hey David, On this run /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648 ceph-objectstore-tool is failing to import a pg because the pool doesn't exist. It looks like the thrasher is doing an export+import and racing with a test that is tearing down a pool. The crash is ceph version 9.0.1-955-ge274efa (e274efa450e99a68c02bcb713c8837d7809f1ec3) 1: ceph-objectstore-tool() [0xa26335] 2: (()+0xfcb0) [0x7f10cef18cb0] 3: (gsignal()+0x35) [0x7f10cd5af425] 4: (abort()+0x17b) [0x7f10cd5b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d] 6: (()+0xb5846) [0x7f10cdf00846] 7: (()+0xb5873) [0x7f10cdf00873] 8: (()+0xb596e) [0x7f10cdf0096e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb0ce09] 10: (ObjectStoreTool::get_object(ObjectStore*, coll_t, ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f] 11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool, std::string)+0x13dd) [0x64a62d] 12: (main()+0x3017) [0x632037] 13: (__libc_start_main()+0xed) [0x7f10cd59a76d] 14: ceph-objectstore-tool() [0x639119] I don't think this is related to my branch.. but maybe? Have you seen this? I rebased onto latest master yesterday. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleting objects from a pool
This is a dangerous command because it can remove all your objects. At least it can only do one namespace at a time. It was intended to cleanup rados bench runs, and is dangerous because it doesn't require extra hoops like rados rmpool does. I'm tempted to disallow a usage this way with empty --prefix/--run-name arguments. David On 6/25/15 10:40 PM, Podoski, Igor wrote: Hi David, You're right, now I see adding --run-name will clean all benchmark data from specified namespace, so you can run command only once. rados -p poolname -N namespace cleanup --prefix --run-name Regards, Igor. -Original Message- From: David Zafman [mailto:dzaf...@redhat.com] Sent: Friday, June 26, 2015 3:46 AM To: Podoski, Igor; Deneau, Tom; Dałek, Piotr; ceph-devel Subject: Re: deleting objects from a pool If you have rados bench data around, you'll need to run cleanup a second time because the first time the benchmark_last_metadata object will be consulted to find what objects to remove. Also, using cleanup this way will only remove objects from the default namespace unless a namespace is specified with the -N option. rados -p poolname -N namespace cleanup --prefix David On 6/24/15 11:06 PM, Podoski, Igor wrote: Hi, It appears, that cleanup can be used as a purge: rados -p poolname cleanup --prefix Regards, Igor. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Wednesday, June 24, 2015 10:22 PM To: Dałek, Piotr; ceph-devel Subject: RE: deleting objects from a pool I've noticed that deleting objects from a basic k=2 m=1 erasure pool is much much slower than deleting a similar number of objects from a replicated size 3 pool (so the same number of files to be deleted). It looked like the ec pool object deletion was almost 20x slower. Is there a lot more work to be done to delete an ec pool object? -- Tom -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Dalek, Piotr Sent: Wednesday, June 24, 2015 11:56 AM To: ceph-devel Subject: Re: deleting objects from a pool -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Wednesday, June 24, 2015 6:44 PM I have benchmarking situations where I want to leave a pool around but delete a lot of objects from the pool. Is there any really fast way to do that? I noticed rados rmpool is fast but I don't want to remove the pool. I have been spawning multiple threads, each deleting a subset of the objects (which I believe is what rados bench write does) but even that can be very slow. For now, apart from rados -p poolname cleanup (which doesn't purge the pool, but merely removes objects written during last benchmark run), the only option is by brute force: for i in $(rados -p poolname ls); do (rados -p poolname rm $i /dev/null ); done; There's no purge pool command in rados -- not yet, at least. I was thinking about one, but never really had time to implement one. With best regards / Pozdrawiam Piotr Dałek -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: deleting objects from a pool
If you have rados bench data around, you'll need to run cleanup a second time because the first time the benchmark_last_metadata object will be consulted to find what objects to remove. Also, using cleanup this way will only remove objects from the default namespace unless a namespace is specified with the -N option. rados -p poolname -N namespace cleanup --prefix David On 6/24/15 11:06 PM, Podoski, Igor wrote: Hi, It appears, that cleanup can be used as a purge: rados -p poolname cleanup --prefix Regards, Igor. -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Wednesday, June 24, 2015 10:22 PM To: Dałek, Piotr; ceph-devel Subject: RE: deleting objects from a pool I've noticed that deleting objects from a basic k=2 m=1 erasure pool is much much slower than deleting a similar number of objects from a replicated size 3 pool (so the same number of files to be deleted). It looked like the ec pool object deletion was almost 20x slower. Is there a lot more work to be done to delete an ec pool object? -- Tom -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Dalek, Piotr Sent: Wednesday, June 24, 2015 11:56 AM To: ceph-devel Subject: Re: deleting objects from a pool -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Wednesday, June 24, 2015 6:44 PM I have benchmarking situations where I want to leave a pool around but delete a lot of objects from the pool. Is there any really fast way to do that? I noticed rados rmpool is fast but I don't want to remove the pool. I have been spawning multiple threads, each deleting a subset of the objects (which I believe is what rados bench write does) but even that can be very slow. For now, apart from rados -p poolname cleanup (which doesn't purge the pool, but merely removes objects written during last benchmark run), the only option is by brute force: for i in $(rados -p poolname ls); do (rados -p poolname rm $i /dev/null ); done; There's no purge pool command in rados -- not yet, at least. I was thinking about one, but never really had time to implement one. With best regards / Pozdrawiam Piotr Dałek -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph-objectstore-tool import failures
Have not seen this as an assert before. Given the code below in do_import() of master branch the assert is impossible (?). if (!curmap.have_pg_pool(pgid.pgid.m_pool)) { cerr Pool pgid.pgid.m_pool no longer exists std::endl; // Special exit code for this error, used by test code return 10; // Positive return means exit status } David On 6/19/15 7:25 PM, Sage Weil wrote: Hey David, On this run /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648 ceph-objectstore-tool is failing to import a pg because the pool doesn't exist. It looks like the thrasher is doing an export+import and racing with a test that is tearing down a pool. The crash is ceph version 9.0.1-955-ge274efa (e274efa450e99a68c02bcb713c8837d7809f1ec3) 1: ceph-objectstore-tool() [0xa26335] 2: (()+0xfcb0) [0x7f10cef18cb0] 3: (gsignal()+0x35) [0x7f10cd5af425] 4: (abort()+0x17b) [0x7f10cd5b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d] 6: (()+0xb5846) [0x7f10cdf00846] 7: (()+0xb5873) [0x7f10cdf00873] 8: (()+0xb596e) [0x7f10cdf0096e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb0ce09] 10: (ObjectStoreTool::get_object(ObjectStore*, coll_t, ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f] 11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool, std::string)+0x13dd) [0x64a62d] 12: (main()+0x3017) [0x632037] 13: (__libc_start_main()+0xed) [0x7f10cd59a76d] 14: ceph-objectstore-tool() [0x639119] I don't think this is related to my branch.. but maybe? Have you seen this? I rebased onto latest master yesterday. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in
Re: ceph-objectstore-tool import failures
This ghobject_t which has a pool of -3 is part of the export. This caused the assert: Read -3/1c/temp_recovering_1.1c_33'50_39_head/head This was added by osd: use per-pool temp poolid for temp objects 18eb2a5fea9b0af74a171c3717d1c91766b15f0c in your branch. You should skip it on export or recreate it on import with special handling. David On 6/19/15 7:38 PM, David Zafman wrote: Have not seen this as an assert before. Given the code below in do_import() of master branch the assert is impossible (?). if (!curmap.have_pg_pool(pgid.pgid.m_pool)) { cerr Pool pgid.pgid.m_pool no longer exists std::endl; // Special exit code for this error, used by test code return 10; // Positive return means exit status } David On 6/19/15 7:25 PM, Sage Weil wrote: Hey David, On this run /a/sage-2015-06-18_15:51:18-rados-wip-temp---basic-multi/939648 ceph-objectstore-tool is failing to import a pg because the pool doesn't exist. It looks like the thrasher is doing an export+import and racing with a test that is tearing down a pool. The crash is ceph version 9.0.1-955-ge274efa (e274efa450e99a68c02bcb713c8837d7809f1ec3) 1: ceph-objectstore-tool() [0xa26335] 2: (()+0xfcb0) [0x7f10cef18cb0] 3: (gsignal()+0x35) [0x7f10cd5af425] 4: (abort()+0x17b) [0x7f10cd5b2b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f10cdf0269d] 6: (()+0xb5846) [0x7f10cdf00846] 7: (()+0xb5873) [0x7f10cdf00873] 8: (()+0xb596e) [0x7f10cdf0096e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x259) [0xb0ce09] 10: (ObjectStoreTool::get_object(ObjectStore*, coll_t, ceph::buffer::list, OSDMap, bool*)+0x143f) [0x64829f] 11: (ObjectStoreTool::do_import(ObjectStore*, OSDSuperblock, bool, std::string)+0x13dd) [0x64a62d] 12: (main()+0x3017) [0x632037] 13: (__libc_start_main()+0xed) [0x7f10cd59a76d] 14: ceph-objectstore-tool() [0x639119] I don't think this is related to my branch.. but maybe? Have you seen this? I rebased onto latest master yesterday. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in
rsyslogd
Greg, Have you changed anything (log rotation related?) that would uninstall or cause rsyslog to not be able to start? I'm sometimes seeing machines fail with this error probably in teuthology/nuke.py reset_syslog_dir(). CommandFailedError: Command failed on plana94 with status 1: 'sudo rm -f -- /etc/rsyslog.d/80-cephtest.conf sudo service rsyslog restart' David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'Racing read got wrong version' during proxy write testing
I'm wonder if this issue could be the cause of #11511. Could a proxy write have raced with the fill_in_copy_get() so object_info_t size doesn't correspond with the size of the object in the filestore? David On 6/3/15 6:22 PM, Wang, Zhiqiang wrote: Making the 'copy get' op to be a cache op seems like a good idea. -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Thursday, June 4, 2015 9:14 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: RE: 'Racing read got wrong version' during proxy write testing On Wed, 3 Jun 2015, Wang, Zhiqiang wrote: I ran into the 'op not idempotent' problem during the testing today. There is one bug in the previous fix. In that fix, we copy the reqids in the final step of 'fill_in_copy_get'. If the object is deleted, since the 'copy get' op is a read op, it returns earlier with ENOENT in do_op. No reqids will be copied during promotion in this case. This again leads to the 'op not idempotent' problem. We need a 'smart' way to detect the op is a 'copy get' op (looping the ops vector doesn't seem smart?) and copy the reqids in this case. Hmm. I think the idea here is/was that that ENOENT would somehow include the reqid list from PGLog::get_object_reqids(). I think teh trick is getting it past the generic check in do_op: if (!op-may_write() !op-may_cache() (!obc-obs.exists || ((m-get_snapid() != CEPH_SNAPDIR) obc-obs.oi.is_whiteout( { reply_ctx(ctx, -ENOENT); return; } Maybe we mark these as cache operations so that may_cache is true? Sam, what do you think? sage -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, May 26, 2015 12:27 AM To: Wang, Zhiqiang Cc: ceph-devel@vger.kernel.org Subject: Re: 'Racing read got wrong version' during proxy write testing On Mon, 25 May 2015, Wang, Zhiqiang wrote: Hi all, I ran into a problem during the teuthology test of proxy write. It is like this: - Client sends 3 writes and a read on the same object to base tier - Set up cache tiering - Client retries ops and sends the 3 writes and 1 read to the cache tier - The 3 writes finished on the base tier, say with versions v1, v2 and v3 - Cache tier proxies the 1st write, and start to promote the object for the 2nd write, the 2nd and 3rd writes and the read are blocked - The proxied 1st write finishes on the base tier with version v4, and returns to cache tier. But somehow the cache tier fails to send the reply due to socket failure injecting - Client retries the writes and the read again, the writes are identified as dup ops - The promotion finishes, it copies the pg_log entries from the base tier and put it in the cache tier's pg_log. This includes the 3 writes on the base tier and the proxied write - The writes dispatches after the promotion, they are identified as completed dup ops. Cache tier replies these write ops with the version from the base tier (v1, v2 and v3) - In the last, the read dispatches, it reads the version of the proxied write (v4) and replies to client - Client complains that 'racing read got wrong version' In a previous discussion of the 'ops not idempotent' problem, we solved it by copying the pg_log entries in the base tier to cache tier during promotion. Seems like there is still a problem with this approach in the above scenario. My first thought is that when proxying the write, the cache tier should use the original reqid from the client. But currently we don't have a way to pass the original reqid from cache to base. Any ideas? I agree--I think the correct fix here is to make the proxied op be recognized as a dup. We can either do that by passing in an optional reqid to the Objecter, or extending the op somehow so that both reqids are listed. I think the first option will be cleaner, but I think we will also need to make sure the 'retry' count is preserved as (I think) we skip the dup check if retry==0. And we probably want to preserve the behavior that a given (reqid, retry) only exists once in the system. This probably means adding more optional args to Objecter::read()...? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: should we prepare to release firefly v0.80.10 ?
In early march I ran rados:thrash on the firefly backport of the ceph-objectstore-tool changes (wip-cot-firefly). We considered it passed, even though an obscure segfault was seen: bug #11141: Segmentation Violation: ceph-objectstore-tool doing --op list-pgs David On 4/21/15 8:52 AM, Sage Weil wrote: The bulk of it is ceph-objectstore-tool, which is important to get into a release, IMO. David, are these being tested in the firefly thrashing tests yet? The only other one I'm worried about is 6fd3dfa osd: do not ignore deleted pgs on startup Sam, I assume the recent hammer upgrade issue is would bite firefly folks who upgrade too? sage On Tue, 21 Apr 2015, Loic Dachary wrote: Hi Sage, The firefly branch has a number of fixes ( http://tracker.ceph.com/issues/11090#Release-information ) and has been used for upgrade tests in the past few weeks. A few other issues have been backported since and are being tested in the integration branch ( http://tracker.ceph.com/issues/11090#teuthology-run-commitb91bbb434e6363a99a632cf3841f70f1f2549f79-integration-branch-april-2015 ). Do you think these changes deserve a firefly v0.80.10 release ? Should we ask each lead for their approval ? Or is it better to keep backporting what needs to be and wait a few weeks ? Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: regenerating man pages
I found that I could not build the docs on Ubuntu 14.10 with the proper packages installed. Kefu is looking into Asphyxiate which is very tempermental. I installed an Ubuntu 11.10 in order to generate docs. David On 3/17/15 10:11 AM, Sage Weil wrote: On Tue, 17 Mar 2015, Josh Durgin wrote: On 03/17/2015 09:40 AM, Ken Dreyer wrote: I had a question about the way that we're handling man pages. In 356a749f63181d401d16371446bb8dc4f196c2a6 , rbd: regenerate rbd(8) man page, it looks like man/rbd.8 was regenerated from doc/man/8/rbd.rst It seems like it would be more efficient to avoid storing man pages in Git and generate them dynamically at build time instead? Yes, that'd be great! https://github.com/ceph/ceph/blob/master/admin/manpage-howto.txt admin/build-doc does a lot of things (including man page generation). Could we simply run the sphinx-build -b man part at build time as a part of make? I don't see a reason not to. It's just a matter of making it work on all the platforms we're building packages for. That might be annoying for the entirety of build-doc, but for just building man pages it should be simple. I think the original reason we didn't was just because there are a lot of dependencies for building the docs, so this inflates Build-Depends. That doesn't particularly bother me, though, if the deps do in fact exist. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hammer incompat bits and ceph-objectstore-tool
During upgrade testing an error occurred because ceph-objectstore-tool found during import on a Firefly node the compat_features from a export from Hammer. There are 2 new feature bits set as shown in the error message: Export has incompatible features set compat={},rocompat={},incompat={12=transaction hints,13=pg meta object} In this case as far as I can tell these osd incompatible changes wouldn't make the export data in any way incompatible. So we may have to check compatibility bits on a case by case basis, if we want to allow the tool to work in the most cases possible. During upgrade testing it is interesting that one node has the transaction hints feature, but other nodes still running firefly don't. Is this a case where we don't have to wait for all OSDs to update before the cluster can start handling OP_COLL_HINT operations? David Zafman -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Building documentation
I was having trouble building man pages on my Ubuntu 14.04 build machine, so I looked at gitbuilder-doc. I saw that it was running Ubuntu 11.10. Even though the end-of-life for Ubuntu 11.10 was May 9, 2013, I installed a new virtual machine with it. I needed to change /etc/apt/sources.list in use old-releases.ubuntu.com to install additional packages. Just like gitbuilder-doc the admin/build-doc command runs without errors. I assume other distributions with more up to date packages will see the same problem. I filed bug #11077 with the sphinx log attached. David Zafman -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Clocks out of sync
On 2 of my rados thrash runs clocks out of sync. Is this an occasional issue or did we have an infrastructure problem? On burnupi19 and burnupi25: 2015-02-20 12:52:52.636017 mon.1 10.214.134.14:6789/0 177 : cluster [WRN] message from mon.0 was stamped 0.501458s in the future, clocks not synchronized On plana62 and plana64: 2015-02-20 10:00:56.842533 mon.0 10.214.132.14:6789/0 3 : cluster [WRN] message from mon.1 was stamped 0.855106s in the future, clocks not synchronized -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Disk failing plana74
A recent test run had an EIO on the following disk: plana74 /dev/sdb The machine is locked right now. David Zafman Senior Developer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
sage-2015-02-15_07:44:23-rados-hammer-distro-basic-multi failures
There were 24 failures before the run was killed. 758289 757223 FAILED assert(weak_refs.empty()) saw valgrind issues” Filed bug #10901 757405 failed to become clean before timeout expired osd.4 killed and never restarted, Thrasher may have died Filed bug #10902 758034: mira038 disk I/O error ceph-1 on /dev/sdf Sandon is aware 757087 757162 758292 758300 758071 already fixed bug #10784 Watch timeout 757177 757385 757506 No JSON object could be decoded bug #10630 757185 osd/ReplicatedPG.cc: 12991: FAILED assert(obc) bug #10820 (testing) 757431 757365 infrastructure - could not read lock status 757601 75 757952 infrastructure - too many values to unpack (immediately after locking machines) 757895 FAILED assert(0 == racing read got wrong version): already fixed bug #10830 758070 758244 757075 757251 757426 infrastructure?Immediate osd crash ERROR: osd init failed: (1) Operation not permitted David Zafman -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
LTTNG
On Ubuntu 12.04.1 LTS after doing an install-deps.sh and the new do_autogen.sh without -L, I get a config error with this in the config.log: configure:22637: checking if lttng-gen-tp is sane configure:22647: result: no configure:22681: checking lttng/tracepoint.h usability configure:22681: gcc -c -g -Wextra -Wno-missing-field-initializers -Wno-missing-declarations -Wno-unused-parameter conftest.c 5 configure:22681: $? = 0 configure:22681: result: yes configure:22681: checking lttng/tracepoint.h presence configure:22681: gcc -E conftest.c configure:22681: $? = 0 configure:22681: result: yes configure:22681: checking for lttng/tracepoint.h configure:22681: result: yes configure:22692: checking for lttng-gen-tp configure:22708: found /usr/bin/lttng-gen-tp configure:22719: result: yes configure:22737: error: in `/home/dzafman/ceph2': configure:22739: error: lttng-gen-tp does not behave properly David Zafman Senior Developer http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'Immutable bit' on pools to prevent deletion
The most secure way would be one in which you can only create pools with WORM set and can't ever change the WORM state of a pool. I like this simple/secure approach as a first cut. David On 1/17/15 11:09 AM, Alex Elsayed wrote: Sage Weil wrote: On Fri, 16 Jan 2015, Alex Elsayed wrote: Wido den Hollander wrote: snip Is it a sane thing to look at 'features' which pools could have? Other features which might be set on a pool: - Read Only (all write operations return -EPERM) - Delete Protected There's another pool feature I'd find very useful: a WORM flag, that permits only create append (at the RADOS level, not the RBD level as was an Emperor blueprint). In particular, I'd _love_ being able to make something that takes Postgres WAL logs and puts them in such a pool, providing real guarantees re: consistency. Similarly, audit logs and such for compliance. How would you want this to work? - If the bit is set, object creates are allowed, but not deletes? What about append? - Are you allowed to clear the bit with something like 'ceph osd pool set pool worm false' ? I'd say that a WORM pool would allow 'create' and 'append' only - that fits well with the classic notions of WORM media, and would allow natural implementations of virtualized WORM tape libraries and such for people who need compatibility (if only object creation was supported, you get issues where either you have absurd tiny objects or risk data loss on failure to write a larger buffered chunk). Similarly, audit records (like selinux logs, that might be written by log daemons) don't really come in nice object-sized chunks. You want to always have the newest in the archive, so you really don't want to buffer up to that. I'd also figure that WORM's main purpose is assurance/compliance - you want to _know_ that nobody could have turned the bit off, futzed with the data, and then turned it back on. Otherwise, you'd just write your clients to only use create/append, and have no need for WORM at the pool level. Because of that, if the flag can be cleared via commands, it should be possible for the admins to forbid it (by flat denying it in config, via some keying system, via the bits being a ratchet, whatever - I'm not especially concerned by how the guarantee is provided, so long as it can be). Setting it should probably also be privileged, since it'd be trivial to cause a DOS by setting it on (say) a CephFS pool - although handling that concern is likely out-of-scope for now, since there are easier ways to ruin someone's day at the RADOS level. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Some gitbuilders not working
We are seeing gitbuilder failures. This is what I saw on one. error: Failed build dependencies: xmlstarlet is needed by ceph-1:0.90-821.g680fe3c.el7.x86_64 David Zafman Senior Developer http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Some gitbuilders not working
We are seeing gitbuilder failures. This is what I saw on one. error: Failed build dependencies: xmlstarlet is needed by ceph-1:0.90-821.g680fe3c.el7.x86_64 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph-objectstore-tool and make check
The objectstore tool has been renamed from ceph_objectstore_tool to ceph-objectstore-tool. Please remove src/.libs/ceph_objectstore_tool and src/.libs/lt-ceph_objectstore_tool or do a make clean with latest master branch. Otherwise, a local make check can fail because the old binary of the tool will always be executed. David Zafman Senior Developer http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pull requests : speed up the reviews
I know I had a couple of pull requests that we weren’t going to merge until after the giant release. This may have applied some of the other ones too. In addition, It isn’t surprising that with a new release some non-release code reviews would be neglected. That being said, this is a good time to remind people to dedicate time to code reviews. David Zafman Senior Developer http://www.inktank.com On Nov 9, 2014, at 4:08 AM, Joao Eduardo Luis j...@redhat.com wrote: On 11/08/2014 05:32 PM, Loic Dachary wrote: Hi Ceph, In the past few weeks the number of pending pull requests grew from around 20 to over 80. The good thing is that there are more contributions, the problem is that it requires more reviewers. Ceph is not the only project suffering from this kind of problem and attending the OpenStack summit last week reminded me that the sooner it is addressed the better. After a few IRC discussions some ideas came up and my favorite is that every developer paid full time to work on Ceph dedicates a daily 15 minutes time slot, time boxed, to review pull requests. Timeboxing is kind of frustrating because some reviews require more. It basically means one has to focus on the pull request for ten minutes at most and take five minutes to write a useful comment that helps the author moving forward. But it also is the only way to make room for a daily activity with no risk of postponing it because something more urgent came up. What do you think ? On my calendar, I do have a time slot of one hour each morning to review pull requests and mailing lists but I seldom honor it, especially when I'm caught up in other stuff. I'll move it over to lunch so that it has no chance in interfering with other tasks and try to make a habit of it. It would also be interesting to see more community involvement. I believe it would be healthy for the project if we could have (at least) a portion of reviews being performed by other people besides solely the paid developers/maintainers. -Joao -- Joao Eduardo Luis Software Engineer | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can pid be reused ?
I just realized what it is. The way killall is used when stopping a vstart cluster, is to kill all processes by name! You can't stop vstarted tests running in parallel. David Zafman Senior Developer http://www.inktank.com On Oct 21, 2014, at 7:55 PM, Loic Dachary l...@dachary.org wrote: Hi, Something strange happens on fedora20 with linux 3.11.10-301.fc20.x86_64. Running make -j8 check on https://github.com/ceph/ceph/pull/2750 a process gets killed from time to time. For instance it shows as TEST_erasure_crush_stripe_width: 124: stripe_width=4096 TEST_erasure_crush_stripe_width: 125: ./ceph osd pool create pool_erasure 12 12 erasure *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** ./test/mon/osd-pool-create.sh: line 120: 27557 Killed ./ceph osd pool create pool_erasure 12 12 erasure TEST_erasure_crush_stripe_width: 126: ./ceph --format json osd dump TEST_erasure_crush_stripe_width: 126: tee osd-pool-create/osd.json in the test logs. Note the 27557 Killed . I originally thought it was because some ulimit was crossed and set them to very generous / unlimited hard / soft thresholds. core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515069 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 40 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited Benoit Canet suggested that I installed systemtap ( https://www.sourceware.org/systemtap/wiki/SystemtapOnFedora ) and ran https://sourceware.org/systemtap/examples/process/sigkill.stp to watch what was sending the kill signal. It showed the following: ... SIGKILL was sent to ceph-osd (pid:27557) by vstart_wrapper. uid:1001 SIGKILL was sent to python (pid:27557) by vstart_wrapper. uid:1001 which suggests that pid 27557 used by ceph-osd was reused for the python script that was killed above. Because the script that kills daemons is very agressive and kill -9 the pid to check if it really is dead https://github.com/ceph/ceph/blob/giant/src/test/mon/mon-test-helpers.sh#L64 it explains the problem. However, as Dan Mick suggests, reusing pid quickly could break a number of things and it is a surprising behavior. Maybe something else is going on. A loop creating processes sees their pid increasing and not being reused. Any idea about what is going on would be much appreciated :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can pid be reused ?
On Oct 22, 2014, at 3:43 PM, Sage Weil s...@newdream.net wrote: On Wed, 22 Oct 2014, David Zafman wrote: I just realized what it is. The way killall is used when stopping a vstart cluster, is to kill all processes by name! You can't stop vstarted tests running in parallel. Ah. FWIW I think we should avoid using stop.sh whenever possible and instead do ./init-ceph stop (which does an orderly shutdown via pid files). sage Actually, vstart.sh can’t create 2 independent clusters anyway, so it kills any existing processes. Probably vstart.sh is what would have killed the processes in a parallel make check. David-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vstart.sh crashes MON with --paxos-propose-interval=0.01 and one MDS
I have this change in my branch so that test/ceph_objectstore_tool.py works again after that change from John. I wonder if this would fix your case too: commit 18937cf49be616d32b4e2d0b6deef2882321fbe4 Author: David Zafman dzaf...@redhat.com Date: Tue Oct 14 18:45:41 2014 -0700 vstart.sh: Disable mon pg warn min per osd to get healthy Signed-off-by: David Zafman dzaf...@redhat.com diff --git a/src/vstart.sh b/src/vstart.sh index febfa56..7a0ec1c 100755 --- a/src/vstart.sh +++ b/src/vstart.sh @@ -394,7 +394,7 @@ $COSDDEBUG $COSDMEMSTORE $extra_conf [mon] -mon pg warn min per osd = 10 +mon pg warn min per osd = 0 mon osd allow primary affinity = true mon reweight min pgs per osd = 4 $DAEMONOPTS David Zafman Senior Developer http://www.inktank.com On Oct 16, 2014, at 3:52 PM, Loic Dachary l...@dachary.org wrote: Hi John, I would be gratefull if you could take a quick look at http://tracker.ceph.com/issues/9794 . It is bisected to the reduction of pg and I'm able to reproduce it in a ubuntu-14.04 docker fresh install. For some reason it does not happen in gitbuilder but I think you can reproduce it locally now. Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: make check failures
After updating my master branch make check” passes now. David Zafman Senior Developer http://www.inktank.com On Oct 7, 2014, at 11:28 PM, Loic Dachary l...@dachary.org wrote: [cc'ing the list in case someone else experiences problems with make check] Hi David, Yesterday you mentioned that make check is failing for you on master. Would you be so kind as to send the logs ? Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
wip-libcommon-rebase
Adam Sage, The commit osd: make coll_t::META static to each file from wip-libcommon has been merged to master. I created a new branch with the other commits on the latest master branch called wip-libcommon-rebase.** It required some conflict resolution in ceph.spec.in. **No warranties are expressed or implied about the correctness or suitability of this branch for future use. David Zafman Senior Developer http://www.inktank.com http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Testing intermediate code for improved namespace handling
Check default namespace (non specified) This keeps the command output compatible with existing scripts ./rados -p test ls default-obj8 default-obj10 default-obj6 default-obj7 default-obj1 default-obj2 default-obj3 default-obj4 default-obj5 default-obj9 Try for all namespaces ./rados -p test -N * ls ns2 ns2-obj3 ns2 ns2-obj5 ns2 ns2-obj10 default-obj8 ns2 ns2-obj4 ns2 ns2-obj2 ns2 ns2-obj8 default-obj10 ns1 ns1-obj5 default-obj6 default-obj7 ns1 ns1-obj4 ns1 ns1-obj10 ns1 ns1-obj2 default-obj1 default-obj2 ns1 ns1-obj9 ns1 ns1-obj3 default-obj3 ns1 ns1-obj6 ns1 ns1-obj1 ns2 ns2-obj7 ns2 ns2-obj9 ns1 ns1-obj8 default-obj4 ns1 ns1-obj7 default-obj5 ns2 ns2-obj6 ns2 ns2-obj1 default-obj9 Try for only one specific namespace ./rados -p test -N ns1 ls ns1-obj5 ns1-obj4 ns1-obj10 ns1-obj2 ns1-obj9 ns1-obj3 ns1-obj6 ns1-obj1 ns1-obj8 ns1-obj7 David Zafman Senior Developer http://www.inktank.com http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Building a tool which links with librados
Has anyone seen anything like this from an application linked with librados using valgrind? Or a Segmentation fault on exit from such an application? Invalid free() / delete / delete[] / realloc() at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x8195C12: std::basic_stringchar, std::char_traitschar, std::allocatorchar ::~basic_string() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16) by 0x13890F3: coll_t::~coll_t() (osd_types.h:468) by 0x8944DEC: __cxa_finalize (cxa_finalize.c:56) by 0x6E1CEC5: ??? (in /src/ceph/src/.libs/librados.so.2.0.0) by 0x725F400: ??? (in /src/ceph/src/.libs/librados.so.2.0.0) by 0x89449D0: __run_exit_handlers (exit.c:78) by 0x8944A54: exit (exit.c:100) by 0x137FF37: usage(boost::program_options::options_description) (ceph_objectstore_tool.cc:1794) by 0x1380572: main (ceph_objectstore_tool.cc:1849) David Zafman Senior Developer http://www.inktank.com http://www.redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a tool which links with librados
The import-rados feature (#8276) uses librados so in my wip-8231 branch I now link with librados. It is hard to reproduce, but I’ll play with that commit and branch. David Zafman Senior Developer http://www.inktank.com http://www.redhat.com On Aug 21, 2014, at 4:56 PM, Sage Weil sw...@redhat.com wrote: On Thu, 21 Aug 2014, Gregory Farnum wrote: On Thu, Aug 21, 2014 at 4:37 PM, David Zafman david.zaf...@inktank.com wrote: Has anyone seen anything like this from an application linked with librados using valgrind? Or a Segmentation fault on exit from such an application? Invalid free() / delete / delete[] / realloc() at 0x4C2A4BC: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) by 0x8195C12: std::basic_stringchar, std::char_traitschar, std::allocatorchar ::~basic_string() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.16) by 0x13890F3: coll_t::~coll_t() (osd_types.h:468) by 0x8944DEC: __cxa_finalize (cxa_finalize.c:56) by 0x6E1CEC5: ??? (in /src/ceph/src/.libs/librados.so.2.0.0) by 0x725F400: ??? (in /src/ceph/src/.libs/librados.so.2.0.0) by 0x89449D0: __run_exit_handlers (exit.c:78) by 0x8944A54: exit (exit.c:100) by 0x137FF37: usage(boost::program_options::options_description) (ceph_objectstore_tool.cc:1794) by 0x1380572: main (ceph_objectstore_tool.cc:1849) This looks fairly strange to me ? why does ceph_objectstore_tool do anything with librados? I thought it was just hitting the OSD filesystem structure directly. Also note that the crash appears to be underneath the coll_t destructor, probably in destroying its string. That combined with the weird librados presence makes me think memory corruption is running over the stack somewhere. Ah, this was fixed in 5d79605319fcde330bccce5e1b07276a98be02de in the wip-libcommon branch. The problem is partly when we link libcommon staticaly (ceph-objectstore-tool) and dynamically (librados) at teh same time. The easy fix here is not linking librados at all. Not sure why we see this sometimes and not always.. maybe link order? In any case, wip-libcommon moves libcommon.la into a .so shared between librados and the binary using it to avoid the problem. Makes things slightly more restrictive with mixed versions, but i suspect it is worth avoiding this sort of pain. Can you cherry-pick that commit and see if it resolves this for you? And/or merge in that entire branch? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] add rocksdb support
Don’t forget when a new submodule is added you need to initialize it. From the README: Building Ceph = To prepare the source tree after it has been git cloned, $ git submodule update --init To build the server daemons, and FUSE client, execute the following: $ ./autogen.sh $ ./configure $ make David Zafman Senior Developer http://www.inktank.com http://www.redhat.com On Jun 13, 2014, at 11:51 AM, Sushma Gurram sushma.gur...@sandisk.com wrote: Hi Xinxin, I tried to compile the wip-rocksdb branch, but the src/rocksdb directory seems to be empty. Do I need toput autoconf/automake in this directory? It doesn't seem to have any other source files and compilation fails: os/RocksDBStore.cc:10:24: fatal error: rocksdb/db.h: No such file or directory compilation terminated. Thanks, Sushma -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Shu, Xinxin Sent: Monday, June 09, 2014 10:00 PM To: Mark Nelson; Sage Weil Cc: ceph-devel@vger.kernel.org; Zhang, Jian Subject: RE: [RFC] add rocksdb support Hi mark I have finished development of support of rocksdb submodule, a pull request for support of autoconf/automake for rocksdb has been created , you can find https://github.com/ceph/rocksdb/pull/2 , if this patch is ok , I will create a pull request for rocksdb submodule support , currently this patch can be found https://github.com/xinxinsh/ceph/tree/wip-rocksdb . -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson Sent: Tuesday, June 10, 2014 1:12 AM To: Shu, Xinxin; Sage Weil Cc: ceph-devel@vger.kernel.org; Zhang, Jian Subject: Re: [RFC] add rocksdb support Hi Xinxin, On 05/28/2014 05:05 AM, Shu, Xinxin wrote: Hi sage , I will add two configure options to --with-librocksdb-static and --with-librocksdb , with --with-librocksdb-static option , ceph will compile the code that get from ceph repository , with --with-librocksdb option , in case of distro packages for rocksdb , ceph will not compile the rocksdb code , will use pre-installed library. is that ok for you ? since current rocksdb does not support autoconfautomake , I will add autoconfautomake support for rocksdb , but before that , i think we should fork a stable branch (maybe 3.0) for ceph . I'm looking at testing out the rocksdb support as well, both for the OSD and for the monitor based on some issues we've been seeing lately. Any news on the 3.0 fork and autoconf/automake support in rocksdb? Thanks, Mark -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, May 21, 2014 9:06 PM To: Shu, Xinxin; Sage Weil Cc: ceph-devel@vger.kernel.org; Zhang, Jian Subject: Re: [RFC] add rocksdb support On 05/21/2014 07:54 AM, Shu, Xinxin wrote: Hi, sage I will add rocksdb submodule into the makefile , currently we want to have fully performance tests on key-value db backend , both leveldb and rocksdb. Then optimize on rocksdb performance. I'm definitely interested in any performance tests you do here. Last winter I started doing some fairly high level tests on raw leveldb/hyperleveldb/raikleveldb. I'm very interested in what you see with rocksdb as a backend. -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Wednesday, May 21, 2014 9:19 AM To: Shu, Xinxin Cc: ceph-devel@vger.kernel.org Subject: Re: [RFC] add rocksdb support Hi Xinxin, I've pushed an updated wip-rocksdb to github/liewegas/ceph.git that includes the latest set of patches with the groundwork and your rocksdb patch. There is also a commit that adds rocksdb as a git submodule. I'm thinking that, since there aren't any distro packages for rocksdb at this point, this is going to be the easiest way to make this usable for people. If you can wire the submodule into the makefile, we can merge this in so that rocksdb support is in the ceph.com packages on ceph.com. I suspect that the distros will prefer to turns this off in favor of separate shared libs, but they can do this at their option if/when they include rocksdb in the distro. I think the key is just to have both --with-librockdb and --with-librocksdb-static (or similar) options so that you can either use the static or dynamically linked one. Has your group done further testing with rocksdb? Anything interesting to share? Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mon_command
My understanding is that are going to be using rados_mon_command() to create pools according to Tracker #7586: deprecate rados_pool_create. What I found when building a test case for EC is that after using the mon_command to create the pool I need to use wait_for_latest_osdmap() in order to wait for the change to propagate. The replicated pool test case using pool_create() doesn’t need to wait_for_latest_osdmap(). Should we be deprecating librados calls in favor of the very generic mon_command() interface? I would suggest that we add the appropriate librados features to manipulate erasure coded pools. David Zafman Senior Developer http://www.inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6685 backfill head/snapdir issue brain dump
Another way to look at this is to enumerate the recovery cases: primary starts with head and no snapdir: A Recovery sets last_backfill_started to head and sends head object where needed head (1.b case while backfills in flight - 1.a when done) snapdir (2) B Recovery sets last_backfill_started to snapdir and would send snapdir remove(s) and same as above case for head head (1.b case while backfills in flight - 1.a when done) snapdir (1.a) primary starts with snapdir and no head: C Recovery set last_backfill_started to head and sends remove of head head 1.a snapdir (2) D Recovery set last_backfill_started to snapdir and sends both remove of head and create of snapdir head 1.a snapdir (1.b case while backfills in flight - 1.a when done) Cases B and D meet our criteria because they include head/snapdir = last_backfill_started and we check head and snapdir for is_degraded_object(). Also, removes are always processed before creates even if recover_backfill() saw them in the other order (case B). That way once the head objects are created (1.a) we know that all snapdirs have been removed too. In other words these 2 cases do not allow an intervening operations to occur that confuses the head - snapdir state. Case C is tricky. An intervening write to head, requires update_range() determining that snapdir is gone even though had it not looked at the log it was going to try to recover (re-create) snapdir. Case A is the only one which has a problem with an intervening deletion of the head object. David On Feb 20, 2014, at 12:07 PM, Samuel Just sam.j...@inktank.com wrote: The current implementation divides the hobject space into two sets: 1) oid | oid = last_backfill_started 2) oid | oid last_backfill_started Space 1) is further divided into two sets: 1.a) oid | oid \notin backfills_in_flight 1.b) oid | oid \in backfills_in_flight The value of this division is that we must send ops in set 1.a to the backfill peer because we won't re-backfill those objects and they must therefore be kept up to date. Furthermore, we *can* send the op because the backfill peer already has all of the dependencies (this statement is where we run into trouble). In set 2), we have not yet backfilled the object, so we are free to not send the op to the peer confident that the object will be backfilled later. In set 1.b), we block operations until the backfill operation is complete. This is necessary at the very least because we are in the process of reading the object and shouldn't be sending writes anyway. Thus, it seems to me like we are blocking, in some sense, the minimum possible set of ops, which is good. The issue is that there is a small category of ops which violate our statement above that we can send ops in set 1.a: ops where the corresponding snapdir object is in set 2 or set 1.b. The 1.b case we currently handle by requiring that snapdir also be !is_degraded_object. The case where the snapdir falls into set 2 should be the problem, but now I am wondering. I think the original problem was as follows: 1) advance last_backfill_started to head 2) complete recovery on head 3) accept op on head which deletes head and creates snapdir 4) start op 5) attempt to recover snapdir 6) race with write and get screwed up Now, however, we have logic to delay backfill on ObjectContexts which currently have write locks. It should suffice to take a write lock on the new snapdir and use that...which we do since the ECBackend patch series. The case where we create head and remove snapdir isn't an issue since we'll just send the delete which will work whether snapdir exists or not... We can also just include a delete in the snapdir creation transaction to make it correctly handle garbage snapdirs on backfill peers. The snapdir would then be superfluously recovered, but that's probably ok? The main issue I see is that it would cause the primary's idea of the replica's backfill_interval to be slightly incorrect (snapdir would have been removed or created on the peer, but not reflected in the master's current backfill_interval which might contain snapdir). We could adjust it in make_writeable, or update_range? Sidenote: multiple backfill peers complicates the issue only slightly. All backfill peers with last_backfill = last_backfill_started are handled uniformly as above. Any backfill_peer with last_backfill last_backfill_started we can model as having a private last_backfill_started equal to last_backfill. This results in a picture for that peer identical to the one above with an empty set 1.b. Because 1.b is empty for these peers, is_degraded_object can disregard them. should_send_op accounts for them with the MAX(last_backfill, last_backfill_started) adjustment. Anyone have anything
wip-libcephfs-emp-rb
I rebased wip-libcephfs and pushed as wip-libcephfs-emp-rb so that we can get this in to the Emperor release. Sage mentioned that he had hit a fuse problem in the wip-libcephfs branch, so apparently the problem is still present. Have you run into this bug in your testing? Are you testing with these modification to Ceph? 2013-10-04T10:39:01.664 INFO:teuthology.task.workunit.client.0.out:[10.214.132.22]: CC kernel/softirq.o 2013-10-04T10:39:02.059 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: *** Caught signal (Segmentation fault) ** 2013-10-04T10:39:02.059 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: in thread 7f57fa316780 2013-10-04T10:39:02.073 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: ceph version 0.69-500-g09f4df0 (09f4df02a866230b19539b03061f4abc5ab47ae2) 2013-10-04T10:39:02.073 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 1: ceph-fuse() [0x5e0d1a] 2013-10-04T10:39:02.073 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 2: (()+0xfcb0) [0x7f57f9cc5cb0] 2013-10-04T10:39:02.073 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 3: (Client::_get_inodeno(Inode*)+0) [0x52dd10] 2013-10-04T10:39:02.073 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 4: (Client::ll_forget(Inode*, int)+0x4a) [0x538d1a] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 5: ceph-fuse() [0x52a1c5] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 6: (fuse_session_loop()+0x75) [0x7f57f9ee3d65] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 7: (main()+0x84c) [0x5266fc] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 8: (__libc_start_main()+0xed) [0x7f57f83f576d] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 9: ceph-fuse() [0x527c99] 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: 2013-10-04 10:39:02.072487 7f57fa316780 -1 *** Caught signal (Segmentation fault) ** 2013-10-04T10:39:02.074 INFO:teuthology.task.ceph-fuse.ceph-fuse.0.err:[10.214.132.22]: in thread 7f57fa316780 David Zafman Senior Developer http://www.inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: xattr limits
Here is the test script: xattr-test.sh Description: Binary data David Zafman Senior Developer http://www.inktank.com On Oct 3, 2013, at 11:02 PM, Loic Dachary l...@dachary.org wrote: Hi David, Would you mind attaching the script to the mail for completness ? It's a useful thing to have :-) Cheers On 04/10/2013 01:21, David Zafman wrote: I want to record with the ceph-devel archive results from testing limits of xattrs for Linux filesystems used with Ceph. Script that creates xattrs with name user.test1, user.test2, …. on a single file 3.10 linux kernel ext4 value bytes number of entries 1 148 16 103 256 14 5127 1024 3 4036 1 Beyond this immediately get ENOSPC btrfs value bytes number of entries 8 10k 16 10k 3210k 6410k 128 10k 256 10k 512 10k slow but worked 1,000,000 got completely hung for minutes at a time during removal strace showed no forward progress 1024 10k 2048 10k 3096 10k Beyond this you start getting ENOSPC after fewer entries xfs (limit entries due to xfs crash with 10k entries) value bytes number of entries 11k 8 1k 161k 321k 64 1k 128 1k 256 1k 512 1k 1024 1k 2048 1k 4096 1k 8192 1k 16384 1k 32768 1k 65536 1k -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing.
RESEND: xattr issue with 3.11 kernel
` setfattr --remove=$entry $FILENAME if [ $? != 0 ]; then echo failure to remove $entry break fi rmcount=`expr $rmcount + 1` done getfattr --dump $FILENAME rmdir $FILENAME done rm src.$$ exit 0 David Zafman Senior Developer http://www.inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] v0.67.4 released
Unit tests on v0.67.4 not passing. Could be a test case needs to be fixed. $ test/encoding/check-generated.sh checking ceph-dencoder generated test instances... numgen type 3 ACLGrant ……. 4 ObjectStore::Transaction mon/PGMap.cc: In function 'void PGMap::apply_incremental(CephContext*, const PGMap::Incremental)' thread 7fac10e81780 time 2013-10-04 18:08:59.019448 mon/PGMap.cc: 226: FAILED assert(inc.get_osd_epochs().find(osd) != inc.get_osd_epochs().end()) ceph version 0.69-548-ge927941 (e927941fcadff56483137cffc0899b4ab9c6c297) 1: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const)+0x697) [0x948bc7] 2: (PGMap::generate_test_instances(std::listPGMap*, std::allocatorPGMap* )+0xc3) [0x949433] 3: (main()+0xce27) [0x5e48f7] 4: (__libc_start_main()+0xed) [0x7fac0ef3176d] 5: ./ceph-dencoder() [0x5eb749] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. terminate called after throwing an instance of 'ceph::FailedAssertion' Aborted David Zafman Senior Developer http://www.inktank.com On Oct 4, 2013, at 4:55 PM, Sage Weil s...@inktank.com wrote: This point release fixes an important performance issue with radosgw, keystone authentication token caching, and CORS. All users (especially those of rgw) are encouraged to upgrade. Notable changes: * crush: fix invalidation of cached names * crushtool: do not crash on non-unique bucket ids * mds: be more careful when decoding LogEvents * mds: fix heap check debugging commands * mon: avoid rebuilding old full osdmaps * mon: fix 'ceph crush move ...' * mon: fix 'ceph osd crush reweight ...' * mon: fix writeout of full osdmaps during trim * mon: limit size of transactions * mon: prevent both unmanaged and pool snaps * osd: disable xattr size limit (prevents upload of large rgw objects) * osd: fix recovery op throttling * osd: fix throttling of log messages for very slow requests * rgw: drain pending requests before completing write * rgw: fix CORS * rgw: fix inefficient list::size() usage * rgw: fix keystone token expiration * rgw: fix minor memory leaks * rgw: fix null termination of buffer For more detail: * http://ceph.com/docs/master/release-notes/#v0-67-4-dumpling * http://ceph.com/docs/master/_downloads/v0.67.4.txt You can get v0.67.4 from the usual locations: * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-0.67.4.tar.gz * For Debian/Ubuntu packages, see http://ceph.com/docs/master/install/debian * For RPMs, see http://ceph.com/docs/master/install/rpm ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
xattr limits
I want to record with the ceph-devel archive results from testing limits of xattrs for Linux filesystems used with Ceph. Script that creates xattrs with name user.test1, user.test2, …. on a single file 3.10 linux kernel ext4 value bytes number of entries 1 148 16 103 256 14 5127 1024 3 4036 1 Beyond this immediately get ENOSPC btrfs value bytes number of entries 8 10k 16 10k 32 10k 64 10k 128 10k 256 10k 512 10k slow but worked 1,000,000 got completely hung for minutes at a time during removal strace showed no forward progress 1024 10k 2048 10k 309610k Beyond this you start getting ENOSPC after fewer entries xfs (limit entries due to xfs crash with 10k entries) value bytes number of entries 1 1k 81k 16 1k 32 1k 64 1k 128 1k 256 1k 512 1k 10241k 2048 1k 4096 1k 8192 1k 163841k 327681k 655361k -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4 failed, 298 passed in dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana
The osd.4 crash in 13443 is bug #5951: 2013-09-23 21:23:28.378428 1034e700 0 filestore(/var/lib/ceph/osd/ceph-4) error (17) File exists not handled on operation 20 (6579.0.0, or op 0, counting from 0) 2013-09-23 21:23:28.862204 1034e700 0 filestore(/var/lib/ceph/osd/ceph-4) unexpected error code 2013-09-23 21:23:28.864816 1034e700 0 filestore(/var/lib/ceph/osd/ceph-4) transaction dump: { ops: [ { op_num: 0, op_name: mkcoll, collection: 4.6_head}, { op_num: 1, op_name: collection_setattr, collection: 4.6_head, name: info, length: 1}, { op_num: 2, op_name: omap_setkeys, collection: meta, oid: 16ef7597\/infos\/head\/\/-1, attr_lens: { 4.6_biginfo: 125, 4.6_epoch: 4, 4.6_info: 576}}, { op_num: 3, op_name: touch, collection: meta, oid: 1039d44e\/pglog_4.6\/0\/\/-1}, { op_num: 4, op_name: omap_rmkeys, collection: meta, oid: 1039d44e\/pglog_4.6\/0\/\/-1}, { op_num: 5, op_name: omap_setkeys, collection: meta, oid: 1039d44e\/pglog_4.6\/0\/\/-1, attr_lens: {}}]} 2013-09-23 21:23:28.959220 1a282700 5 osd.4 pg_epoch: 424 pg[32.0( empty local-les=424 n=0 ec=96 les/c 400/400 419/419/419) [4,1] r=0 lpr=419 pi=364-418/6 mlcod 0'0 active] enter Started/Primary/Active/Activating 2013-09-23 21:23:29.116768 1034e700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)' thread 1034e700 time 2013-0 9-23 21:23:28.920055 os/FileStore.cc: 2461: FAILED assert(0 == unexpected error) ceph version 0.69-220-g4f7526a (4f7526a785692795ee29f7101b8b18482b4c6e11) 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int)+0xffc) [0x72473c] 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x71) [0x72b241] 3: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle)+0x291) [0x72b4f1] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x93da36] 5: (ThreadPool::WorkThread::entry()+0x10) [0x93f840] 6: (()+0x7e9a) [0x503be9a] 7: (clone()+0x6d) [0x6c71ccd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. David Zafman Senior Developer http://www.inktank.com On Sep 24, 2013, at 12:03 PM, Sage Weil s...@inktank.com wrote: On Tue, 24 Sep 2013, David Zafman wrote: Rados suite test run results for wip-5862. 2 scrub mismatch from mon (known problem). 2 are valgrind issues found with mds and osd. What is the osd valgrind failure? And the osd.4 crash on 13443? (Note that the teuthology.log will include message about valgrind issues found in the mds log, but does not generate an actual error about it.) Thanks! sage David Zafman Senior Developer http://www.inktank.com Begin forwarded message: From: teuthwor...@teuthology.front.sepia.ceph.com Subject: 4 failed, 298 passed in dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana Date: September 23, 2013 10:48:00 PM PDT To: david.zaf...@inktank.com Test Run: dzafman-2013-09-23_17:50:06-rados-wip-5862-testing-basic-plana = logs: http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wip -5862-testing-basic-plana/ failed: 4 hung: 0 passed: 298 Failed = [13187] rados/monthrash/{ceph/ceph.yaml clusters/3-mons.yaml fs/xfs.yaml msgr-failures/mon-delay.yaml thrashers/force-sync-many.yaml workloads/snaps-few-objects.yaml} - time: 2095s log: http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi p-5862-testing-basic-plana/13187/ 2013-09-23 18:22:37.805204 mon.0 10.214.132.24:6789/0 514 : [ERR] scrub mismatch in cluster log [13449] rados/verify/{1thrash/none.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml} - time: 1067s log: http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi p-5862-testing-basic-plana/13449/ saw valgrind issues [13443] rados/verify/{1thrash/default.yaml clusters/fixed-2.yaml fs/btrfs.yaml msgr-failures/few.yaml tasks/rados_api_tests.yaml validater/valgrind.yaml} - time: 1307s log: http://qa-proxy.ceph.com/teuthology/dzafman-2013-09-23_17:50:06-rados-wi p-5862-testing-basic-plana/13443/ timed out waiting for admin_socket to appear after osd.4 restart [13227] rados/monthrash/{ceph/ceph.yaml
Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out
The behavior you are seeing is exactly what would be expected if OSDs are not being marked out. The testing of my fix showed that if a portion of a rack's OSDs go down they will be marked out after the configured amount of time (5 min by default). Once down OSDs are out the remaining OSDs take responsibility for holding the data assigned to that rack. Though I didn't look at the data movement, I'm confident that it will work. You can simply mark your OSDs out manually to verify that missing replicas are replaced. David Zafman Senior Developer http://www.inktank.com On Apr 26, 2013, at 1:50 AM, Martin Mailand mar...@tuxadero.com wrote: Hi David, did you test it with more than one rack as well? In my first problem I used two racks, with a custom crushmap, so that the replicas are in the two racks (replicationlevel = 2). Than I took one osd down, and expected that the remaining osds in this rack would get the now missing replicas from the osd of the other rack. But nothing happened, the cluster stayed degraded. -martin On 26.04.2013 02:22, David Zafman wrote: I filed tracker bug 4822 and have wip-4822 with a fix. My manual testing shows that it works. I'm building a teuthology test. Given your osd tree has a single rack it should always mark OSDs down after 5 minutes by default. David Zafman Senior Developer http://www.inktank.com On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Sage, On 25.04.2013 18:17, Sage Weil wrote: What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage root@store1:~# ceph osd tree # idweight type name up/down reweight -1 24 root default -3 24 rack unknownrack -2 4 host store1 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 3 1 osd.3 up 1 -4 4 host store3 10 1 osd.10 up 1 11 1 osd.11 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -5 4 host store4 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 -6 4 host store5 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 -7 4 host store6 20 1 osd.20 up 1 21 1 osd.21 up 1 22 1 osd.22 up 1 23 1 osd.23 up 1 -8 4 host store2 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 [mon.b] mon host = store3 mon addr = 192.168.195.33:6789 [mon.c] mon host = store5 mon addr = 192.168.195.35:6789 ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out
Mike / Martin, The OSD down behavior Mike is seeing is different. You should be seeing messages like this in your leader's monitor log: can_mark_down current up_ratio 0.17 min 0.3, will not mark osd.2 down To dampen certain kinds of cascading failures, we are deliberately restricting automatically marking 30% of OSDs down. As far as Martin is concerned his osd tree shows a single rack, but said that his crush rules are supposed to put a replica on each of 2 racks. I don't remember seeing his crush rules in any of the e-mails, but even so he only has unknownrack with id -3 defined. David Zafman Senior Developer http://www.inktank.com On Apr 26, 2013, at 6:44 AM, Mike Dawson mike.daw...@scholarstack.com wrote: David / Martin, I can confirm this issue. At present I am running monitors only with 100% of my OSD processes shutdown down. For the past couple hours, Ceph has reported: osdmap e1323: 66 osds: 19 up, 66 in I can mark them down manually using ceph osd down 0 as expected, but they never get marked down automatically. Like Martin, I also have a custom crushmap, but this cluster is operating with a single rack. I'll be happy to provide any documentation / configs / logs you would like. I am currently running ceph version 0.60-666-ga5cade1 (a5cade1fe7338602fb2bbfa867433d825f337c87) from gitbuilder. - Mike On 4/26/2013 4:50 AM, Martin Mailand wrote: Hi David, did you test it with more than one rack as well? In my first problem I used two racks, with a custom crushmap, so that the replicas are in the two racks (replicationlevel = 2). Than I took one osd down, and expected that the remaining osds in this rack would get the now missing replicas from the osd of the other rack. But nothing happened, the cluster stayed degraded. -martin On 26.04.2013 02:22, David Zafman wrote: I filed tracker bug 4822 and have wip-4822 with a fix. My manual testing shows that it works. I'm building a teuthology test. Given your osd tree has a single rack it should always mark OSDs down after 5 minutes by default. David Zafman Senior Developer http://www.inktank.com On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Sage, On 25.04.2013 18:17, Sage Weil wrote: What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage root@store1:~# ceph osd tree # id weight type name up/down reweight -1 24 root default -3 24 rack unknownrack -2 4 host store1 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 3 1 osd.3 up 1 -4 4 host store3 10 1 osd.10 up 1 11 1 osd.11 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -5 4 host store4 12 1 osd.12 up 1 13 1 osd.13 up 1 14 1 osd.14 up 1 15 1 osd.15 up 1 -6 4 host store5 16 1 osd.16 up 1 17 1 osd.17 up 1 18 1 osd.18 up 1 19 1 osd.19 up 1 -7 4 host store6 20 1 osd.20 up 1 21 1 osd.21 up 1 22 1 osd.22 up 1 23 1 osd.23 up 1 -8 4 host store2 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 [mon.b] mon host = store3 mon addr = 192.168.195.33:6789 [mon.c] mon host = store5 mon addr = 192.168.195.35:6789 ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message
Re: [ceph-users] cuttlefish countdown -- OSD doesn't get marked out
I filed tracker bug 4822 and have wip-4822 with a fix. My manual testing shows that it works. I'm building a teuthology test. Given your osd tree has a single rack it should always mark OSDs down after 5 minutes by default. David Zafman Senior Developer http://www.inktank.com On Apr 25, 2013, at 9:38 AM, Martin Mailand mar...@tuxadero.com wrote: Hi Sage, On 25.04.2013 18:17, Sage Weil wrote: What is the output from 'ceph osd tree' and the contents of your [mon*] sections of ceph.conf? Thanks! sage root@store1:~# ceph osd tree # id weight type name up/down reweight -124 root default -324 rack unknownrack -24 host store1 0 1 osd.0 up 1 1 1 osd.1 down1 2 1 osd.2 up 1 3 1 osd.3 up 1 -44 host store3 101 osd.10 up 1 111 osd.11 up 1 8 1 osd.8 up 1 9 1 osd.9 up 1 -54 host store4 121 osd.12 up 1 131 osd.13 up 1 141 osd.14 up 1 151 osd.15 up 1 -64 host store5 161 osd.16 up 1 171 osd.17 up 1 181 osd.18 up 1 191 osd.19 up 1 -74 host store6 201 osd.20 up 1 211 osd.21 up 1 221 osd.22 up 1 231 osd.23 up 1 -84 host store2 4 1 osd.4 up 1 5 1 osd.5 up 1 6 1 osd.6 up 1 7 1 osd.7 up 1 [global] auth cluster requierd = none auth service required = none auth client required = none # log file = log_max_recent=100 log_max_new=100 [mon] mon data = /data/mon.$id [mon.a] mon host = store1 mon addr = 192.168.195.31:6789 [mon.b] mon host = store3 mon addr = 192.168.195.33:6789 [mon.c] mon host = store5 mon addr = 192.168.195.35:6789 ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
.gitignore issues
After updating to latest master I have the following files listed by git status: $ git status # On branch master # Untracked files: # (use git add file... to include in what will be committed) # # src/bench_log # src/ceph-filestore-dump # src/ceph.conf # src/dupstore # src/keyring # src/kvstorebench # src/multi_stress_watch # src/omapbench # src/psim # src/radosacl # src/scratchtool # src/scratchtoolpp # src/smalliobench # src/smalliobenchdumb # src/smalliobenchfs # src/smalliobenchrbd # src/streamtest # src/testcrypto # src/testkeys # src/testrados # src/testrados_delete_pools_parallel # src/testrados_list_parallel # src/testrados_open_pools_parallel # src/testrados_watch_notify # src/testsignal_handlers # src/testtimers # src/tpbench # src/xattr_bench nothing added to commit but untracked files present (use git add to track) David Zafman Senior Developer david.zaf...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] two small patches for CEPH wireshark plugin
You could look at the wip-wireshark-zafman branch. I rebased it and force pushed it. It has changes to the wireshark.patch and a minor change I needed to get it to build. I'm surprised the recent checkin didn't include the change to packet-ceph.c which I needed to get it to build. David Zafman Senior Developer david.zaf...@inktank.com On Jan 24, 2013, at 12:49 PM, Danny Al-Gaaf danny.al-g...@bisect.de wrote: Am 24.01.2013 19:31, schrieb Sage Weil: Hi Danny! [...] Since you brought up wireshark... We would LOVE LOVE LOVE it if this plugin could get upstream into wireshark. Yes, this would be great. IIRC, the problem (last time we checked, ages ago) was that there were strict coding guidelines for that project that weren't followed. I'm not sure if that is still the case, or even if that is accurate. It would be great if someone on this list who is looking for a way to contribute could take the lead on trying to make this happen... :-) I'll take a look at it maybe ... if I find some free time for it. What about the patches? Can we apply them to the ceph git tree until we have another solution for the wireshark code? Danny -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
master branch issue in ceph.git
The latest code is hanging trying to start teuthology. I used teuthology-nuke to clear old state and reboot the machines. I was using my branch rebased to latest master and when that started failing I switched to the default config. It still keeps hanging here: INFO:teuthology.task.ceph:Waiting until ceph is healthy... $ ceph -s health HEALTH_WARN 5 pgs degraded; 108 pgs stuck unclean monmap e1: 3 mons at {0=10.214.131.23:6789/0,1=10.214.131.21:6789/0,2=10.214.131.20:6789/0}, election epoch 6, quorum 0,1,2 0,1,2 osdmap e7: 9 osds: 9 up, 9 in pgmap v25: 108 pgs: 103 active+remapped, 5 active+degraded; 0 bytes data, 798 GB used, 3050 GB / 4055 GB avail mdsmap e2: 0/0/0 up David Zafman Senior Developer david.zaf...@inktank.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: Interfaces proposed changes
I sent this proposal out to the developers that own the FSAL CEPH portion of Nfs-Ganesha. They have changes to Ceph that expose additional interfaces for this. This is our initial cut at improving the interfaces. David Zafman Senior Developer david.zaf...@inktank.com Begin forwarded message: From: David Zafman david.zaf...@inktank.com Subject: Interfaces proposed changes Date: January 4, 2013 5:50:49 PM PST To: Matthew W. Benjamin m...@linuxbox.com, Adam C. Emerson aemer...@linuxbox.com Below is a patch that shows the newly proposed low-level interface. Obviously, the ceph_ll_* functions you created in libcephfs.cc will have the corresponding changes made to them. An Fh * type is used for as an open file descriptor and needs a corresponding ll_release()/ceph_ll_close(). An Inode * returned by various inode create functions and ll_lookup_ino() is a referenced inode and needs a corresponding _ll_put() exposed via something maybe named ceph_ll_put(). The existing FSAL CEPH doesn't ever call ceph_ll_forget() even though there are references taken on inodes ceph ll_* operation level. This interface creates a clearer model to be used by FSAL CEPH. As I don't understand Ganesha's inode caching model, it isn't clear to me if it can indirectly hold inodes that are below FSAL. Especially for NFS v3 where there is no open state, the code shouldn't keep doing final release of an inode after every operation. diff --git a/src/client/Client.cc b/src/client/Client.cc index d876454..4d4d0f1 100644 --- a/src/client/Client.cc +++ b/src/client/Client.cc @@ -6250,13 +6250,39 @@ bool Client::ll_forget(vinodeno_t vino, int num) return last; } + +inodeno_t Client::ll_get_ino(Inode *in) +{ + return in-ino; +} + +snapid_t Client::ll_get_snapid(Inode *in) +{ + return in-snapid; +} + +vinodeno_t Client::ll_get_vino(Inode *in) +{ + return vinodeno_t(in-ino, in-snapid); +} + +Inode *Client::ll_lookup_ino(vinodeno_t vino) +{ + Mutex::Locker lock(client_lock); + hash_mapvinodeno_t,Inode*::iterator p = inode_map.find(vino); + if (p == inode_map.end()) +return NULL; + Inode *in = p-second; + _ll_get(in); + return in; +} + Inode *Client::_ll_get_inode(vinodeno_t vino) { assert(inode_map.count(vino)); return inode_map[vino]; } - int Client::ll_getattr(vinodeno_t vino, struct stat *attr, int uid, int gid) { Mutex::Locker lock(client_lock); @@ -7219,7 +7245,7 @@ int Client::ll_release(Fh *fh) return 0; } - +// -- diff --git a/src/client/Client.h b/src/client/Client.h index 9512a2d..0cfe8d9 100644 --- a/src/client/Client.h +++ b/src/client/Client.h @@ -706,6 +706,32 @@ public: void ll_register_ino_invalidate_cb(client_ino_callback_t cb, void *handle); void ll_register_getgroups_cb(client_getgroups_callback_t cb, void *handle); + + // low-level interface v2 + inodeno_t ll_get_ino(Inode *in); + snapid_t ll_get_snapid(Inode *in); + vinodeno_t ll_get_vino(Inode *in); + Inode *ll_lookup_ino(vinodeno_t vino); + int ll_lookup(Inode *parent, const char *name, struct stat *attr, Inode **out, int uid = -1, int gid = -1); + bool ll_forget(Inode *in, int count); + int ll_getattr(Inode *in, struct stat *st, int uid = -1, int gid = -1); + int ll_setattr(Inode *in, struct stat *st, int mask, int uid = -1, int gid = -1); + int ll_getxattr(Inode *in, const char *name, void *value, size_t size, int uid=-1, int gid=-1); + int ll_setxattr(Inode *in, const char *name, const void *value, size_t size, int flags, int uid=-1, int gid=-1); + int ll_removexattr(Inode *in, const char *name, int uid=-1, int gid=-1); + int ll_listxattr(Inode *in, char *list, size_t size, int uid=-1, int gid=-1); + int ll_opendir(Inode *in, void **dirpp, int uid = -1, int gid = -1); + int ll_readlink(Inode *in, const char **value, int uid = -1, int gid = -1); + int ll_mknod(Inode *in, const char *name, mode_t mode, dev_t rdev, struct stat *attr, Inode **out, int uid = -1, int gid = -1); + int ll_mkdir(Inode *in, const char *name, mode_t mode, struct stat *attr, Inode **out, int uid = -1, int gid = -1); + int ll_symlink(Inode *in, const char *name, const char *value, struct stat *attr, Inode **out, int uid = -1, int gid = -1); + int ll_unlink(Inode *in, const char *name, int uid = -1, int gid = -1); + int ll_rmdir(Inode *in, const char *name, int uid = -1, int gid = -1); + int ll_rename(Inode *parent, const char *name, Inode *newparent, const char *newname, int uid = -1, int gid = -1); + int ll_link(Inode *in, Inode *newparent, const char *newname, struct stat *attr, int uid = -1, int gid = -1); + int ll_open(Inode *in, int flags, Fh **fh, int uid = -1, int gid = -1); + int ll_create(Inode *parent, const char *name, mode_t mode, int flags, struct stat *attr, Inode **out, int uid = -1, int gid = -1); + int ll_statfs(Inode *in, struct statvfs *stbuf
Re: [PATCH REPOST 0/4] rbd: four minor patches
I reviewed these. Reviewed-by: David Zafman david.zaf...@inktank.com David Zafman Senior Developer david.zaf...@inktank.com On Jan 3, 2013, at 11:04 AM, Alex Elder el...@inktank.com wrote: I'm re-posting my patch backlog, in chunks that may or may not match how they got posted before. This series contains some pretty fairly straightforward changes. -Alex [PATCH REPOST 1/4] rbd: document rbd_spec structure [PATCH REPOST 2/4] rbd: kill rbd_spec-image_name_len [PATCH REPOST 3/4] rbd: kill rbd_spec-image_id_len [PATCH REPOST 4/4] rbd: use kmemdup() -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
testing branch of ceph-client repo was force pushed
I amended the last 5 commits which I committed to the testing branch last night. Please update your repositories accordingly. David-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 0.55 init script Issue?
Keep in mind that some of the init.d stuff doesn't work with a ceph-deploy installed system. Not clear to me if we need to fix ceph-deploy or for those type of setups only upstart should be used/available. David On Dec 5, 2012, at 11:41 AM, Dan Mick dan.m...@inktank.com wrote: The story as best I know it is that we're trying to transition to and use upstart where possible, but that the upstart config does not (yet?) try to do what the init.d config did. That is, it doesn't support options to the one script, but rather separates daemons into separate services, and does not reach out to remote machines to start daemons, etc. The intent is that init.d/ceph is left for non-Upstart distros, AFAICT. Tv had some design notes here: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg09314.html We need better documentation/rationale here at least. On 12/05/2012 08:15 AM, Mike Dawson wrote: All, After upgrading from 0.54 to 0.55, the command service ceph start fails. But /etc/init.d/ceph start works. This is the case for start. stop, etc. Here is an example: root@node2:~# /etc/init.d/ceph stop === mon.a === Stopping Ceph mon.a on node2...kill 2505...done === osd.0 === Stopping Ceph osd.0 on node2...kill 5042...done === osd.1 === Stopping Ceph osd.1 on node2...kill 5116...done === osd.17 === Stopping Ceph osd.17 on node2...kill 5275...done root@node2:~# service ceph start start: Job is already running: ceph root@node2:~# /etc/init.d/ceph start === mon.a === Starting Ceph mon.a on node2... starting mon.a rank 0 at 172.16.1.2:6789/0 mon_data /var/lib/ceph/mon/ceph-a fsid 4951e786-945e-47b6-b1b1-4043b6cc3b55 === osd.0 === Starting Ceph osd.0 on node2... starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /dev/sda6 === osd.1 === Starting Ceph osd.1 on node2... starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /dev/sda7 === osd.17 === Starting Ceph osd.17 on node2... starting osd.17 at :/0 osd_data /var/lib/ceph/osd/ceph-17 /dev/sda8 This is Ubuntu 12.10 with packages from debian-testing. One other user on IRC confirmed the same behavior. Is this a known issue? Thanks, Mike Dawson -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop and Ceph client/mds view of modification time
On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote: On Tue, 27 Nov 2012, Sam Lang wrote: 3. When a client acquires the cap for a file, have the mds provide its current time as well. As the client updates the mtime, it uses the timestamp provided by the mds and the time since the cap was acquired. Except for the skew caused by the message latency, this approach allows the mtime to be based off the mds time, so it will be consistent across clients and the mds. It does however, allow a client to set an mtime to the future (based off of its local time), which might be undesirable, but that is more like how NFS behaves. Message latency probably won't be much of an issue either, as the granularity of mtime is a second. Also, the client can set its cap acquired timestamp to the time at which the cap was requested, ensuring that the relative increment includes the round trip latency so that the mtime will always be set further ahead. Of course, this approach would be a lot more intrusive to implement. :-) Yeah, I'm less excited about this one. I think that giving consistent behavior from a single client despite clock skew is a good goal. That will make things like pjd's test behave consistently, for example. My suggestion is that a client writing to a file will try to use it's local clock unless it would cause the mtime to go backward. In that case it will simply perform the minimum mtime advance possible (1 second?). This handles the case in which one client created a file using his clock (per previous suggested change), then another client writes with a clock that is behind. David -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop and Ceph client/mds view of modification time
On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote: On 11/27/2012 12:01 PM, Sage Weil wrote: On Tue, 27 Nov 2012, David Zafman wrote: On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote: On Tue, 27 Nov 2012, Sam Lang wrote: 3. When a client acquires the cap for a file, have the mds provide its current time as well. As the client updates the mtime, it uses the timestamp provided by the mds and the time since the cap was acquired. Except for the skew caused by the message latency, this approach allows the mtime to be based off the mds time, so it will be consistent across clients and the mds. It does however, allow a client to set an mtime to the future (based off of its local time), which might be undesirable, but that is more like how NFS behaves. Message latency probably won't be much of an issue either, as the granularity of mtime is a second. Also, the client can set its cap acquired timestamp to the time at which the cap was requested, ensuring that the relative increment includes the round trip latency so that the mtime will always be set further ahead. Of course, this approach would be a lot more intrusive to implement. :-) Yeah, I'm less excited about this one. I think that giving consistent behavior from a single client despite clock skew is a good goal. That will make things like pjd's test behave consistently, for example. My suggestion is that a client writing to a file will try to use it's local clock unless it would cause the mtime to go backward. In that case it will simply perform the minimum mtime advance possible (1 second?). This handles the case in which one client created a file using his clock (per previous suggested change), then another client writes with a clock that is behind. We can choose to not decrement at the client, but because mtime is a time_t (seconds since epoch), we can't increment by 1 for each write. 1000 writes each taking 0.01s would move the mtime 990 seconds into the future. The mtime update shouldn't work that way (see below). That's a possibility (if it's 1ms or 1ns, at least :). We need to verify what POSIX says about that, though: if you utimes(2) an mtime into the future, what happens on write(2)? On ext4 a write(2) after mtime set into the future with utimes(2) does the time go backward. However, we can notice that if ctime == mtime then only create/write/truncate has last been done to the file. This means that we should not let the mtime go backward in that case. If the ctime != mtime, then the mtime has been set by utimes(2), so we can set mtime using our clock even if it goes backwards. According to http://pubs.opengroup.org/onlinepubs/009695399/, writes only require an update to mtime, it doesn't specify what the update should be: Upon successful completion, where nbyte is greater than 0, write() shall mark for update the st_ctime and st_mtime fields of the file, and if the file is a regular file, the S_ISUID and S_ISGID bits of the file mode may be cleared. What this really means is that all writes mark mtime for update but not setting a specific time in the inode yet. All writes/truncates will be rolled into a single mtime bump. So even if we only have 1 second granularity (but hopefully it is 1 ms or 1 us), when a stat occurs (or in our case sending info to MDS or returning capabilities) only then does a new mtime need to be set and it will be at most 1 second ahead. In NFS, the server sets the mtime. Its relatively common to see Warning: file 'foo' has modification time in the future if you're compiling on nfs and your client and nfs server clocks are skewed. So allowing the mtime to be set in the near future would at least follow the principle of least surprise for most folks. So Ceph can see this warning too if different skewed clocks are setting mtime and it appears in the future to some clients. -sam sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hadoop and Ceph client/mds view of modification time
On Nov 27, 2012, at 1:14 PM, Sam Lang sam.l...@inktank.com wrote: On 11/27/2012 01:38 PM, David Zafman wrote: On Nov 27, 2012, at 11:05 AM, Sam Lang sam.l...@inktank.com wrote: On 11/27/2012 12:01 PM, Sage Weil wrote: On Tue, 27 Nov 2012, David Zafman wrote: On Nov 27, 2012, at 9:03 AM, Sage Weil s...@inktank.com wrote: On Tue, 27 Nov 2012, Sam Lang wrote: 3. When a client acquires the cap for a file, have the mds provide its current time as well. As the client updates the mtime, it uses the timestamp provided by the mds and the time since the cap was acquired. Except for the skew caused by the message latency, this approach allows the mtime to be based off the mds time, so it will be consistent across clients and the mds. It does however, allow a client to set an mtime to the future (based off of its local time), which might be undesirable, but that is more like how NFS behaves. Message latency probably won't be much of an issue either, as the granularity of mtime is a second. Also, the client can set its cap acquired timestamp to the time at which the cap was requested, ensuring that the relative increment includes the round trip latency so that the mtime will always be set further ahead. Of course, this approach would be a lot more intrusive to implement. :-) Yeah, I'm less excited about this one. I think that giving consistent behavior from a single client despite clock skew is a good goal. That will make things like pjd's test behave consistently, for example. My suggestion is that a client writing to a file will try to use it's local clock unless it would cause the mtime to go backward. In that case it will simply perform the minimum mtime advance possible (1 second?). This handles the case in which one client created a file using his clock (per previous suggested change), then another client writes with a clock that is behind. We can choose to not decrement at the client, but because mtime is a time_t (seconds since epoch), we can't increment by 1 for each write. 1000 writes each taking 0.01s would move the mtime 990 seconds into the future. The mtime update shouldn't work that way (see below). That's a possibility (if it's 1ms or 1ns, at least :). We need to verify what POSIX says about that, though: if you utimes(2) an mtime into the future, what happens on write(2)? On ext4 a write(2) after mtime set into the future with utimes(2) does the time go backward. However, we can notice that if ctime == mtime then only create/write/truncate has last been done to the file. This means that we should not let the mtime go backward in that case. If the ctime != mtime, then the mtime has been set by utimes(2), so we can set mtime using our clock even if it goes backwards. I'm not sure I follow you here. utimes(2) can set mtime and ctime to same, different, set mtime and/or ctime to current time. That makes it hard to rely on the mtime != ctime conditional. utimes(2) does not allow you to modify ctime. As a matter of fact if you set mtime, ctime will always be set to localtime. On a single system with only a forward moving clock, ctime can never go backwards nor will ever look like it is in the future. Also, when setting mtime to now it is the case that ctime == mtime. Ceph should insure this ctime only moves forward. Unfortunately, it can't do that and prevent ctime from looking like it is in the future, but NFS doesn't either because the ctime is always set by the NFS server clock. ubuntu@client:~$ date ; touch file ; stat file; sleep 60; date ; echo foo file ; stat file ; sleep 15; date ; touch -m -t 11281200 file ; stat file ; sleep 15 ; date ; touch -m file ; stat file ###CREATE (atime == mtime == ctime) Tue Nov 27 13:44:23 PST 2012 File: `file' Size: 4 Blocks: 8 IO Block: 4096 regular file Device: 805h/2053d Inode: 145126 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ ubuntu) Gid: ( 1000/ ubuntu) Access: 2012-11-27 13:44:23.685986005 -0800 Modify: 2012-11-27 13:44:23.685986005 -0800 Change: 2012-11-27 13:44:23.685986005 -0800 Birth: - ###WRITE (mtime == ctime) advanced Tue Nov 27 13:45:23 PST 2012 File: `file' Size: 8 Blocks: 8 IO Block: 4096 regular file Device: 805h/2053d Inode: 145126 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ ubuntu) Gid: ( 1000/ ubuntu) Access: 2012-11-27 13:44:23.685986005 -0800 Modify: 2012-11-27 13:45:23.701986456 -0800 Change: 2012-11-27 13:45:23.701986456 -0800 Birth: - UTIMES(2) mtime in the future, ctime set to local clock Tue Nov 27 13:45:38 PST 2012 File: `file' Size: 8 Blocks: 8 IO Block: 4096 regular file Device: 805h/2053d Inode: 145126 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ ubuntu) Gid: ( 1000/ ubuntu) Access: 2012-11-27 13:44:23.685986005 -0800 Modify: 2012-11-28 12:00:00.0
Re: getting kernel debug output
I also added a kcon_most teuthology task which does almost the same thing as ceph/src/script/kcon_most.sh to all or any set of clients. The teuthology version does not raise the console log level. For example: tasks: - ceph: - kclient: - kcon_most: - interactive: On Oct 24, 2012, at 11:14 AM, Alex Elder el...@inktank.com wrote: On 10/24/2012 12:11 PM, Sage Weil wrote: I'm working on http://tracker.newdream.net/issues/3342 and was able to reproduce the msgr bug (some annoying msgr race I think) while generating full libceph debug output. I used a teuthology yaml fragment like so: I have more trouble than that, but perhaps there's something weird about having my serial console connected from 1500 miles away. I'm impressed full debugging didn't mess things up. tasks: - clock: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: null - kclient: null - exec: client.0: - echo 'module libceph +p' /sys/kernel/debug/dynamic_debug/control This is cool, I didn't know you could do this. - workunit: clients: all: - suites/ffsb.sh I was pleasantly surprised that even though this is putting copious amounts of crap in dmesg it didn't slow things down enough to avoid tripping the bug. And the 'dmesg' command in kdb appears to be working now (a couple months back it wasn't). Yay! For me, dmesg has been working, but I'd like to know how to truncate the output to just, say, the last 200 lines. (Maybe there is one.) Anyway, this might be useful in tracking down other bugs as well... Yes, this is good news. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html