cephfs (hammer) flips directory access bits
Hi, we are using cephfs on a ceph cluster (V0.94.5, 3x MON, 1x MDS, ~50x OSD). Recently, we observed a spontaneous (and unwanted) change in the access rights of newly created directories: $ umask 0077 $ mkdir test $ ls -ld test drwx-- 1 me me 0 Jan 6 14:59 test $ touch test/foo $ ls -ld test drwxrwxrwx 1 me me 0 Jan 6 14:59 test $ I kindly would like to ask for help tracking down this issue. ciao Christian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The osd process locked itself , when I tested cephfs through filebench
Hi all, When I tested randomrw on my cluster through filebench (running ceph 0.94.5) , one of the osds was marked down. but I could still get the process with ps command. So I checked the log fiile and found follow message: > 2016-01-07 02:41:02.104124 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.5 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104156 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.6 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104168 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.7 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104182 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.8 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104194 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.12 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104208 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.15 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104226 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.16 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:02.104253 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.17 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:55.104035) 2016-01-07 02:41:03.104394 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.3 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104441 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.4 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104451 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.5 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104459 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.6 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104467 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.7 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104495 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.8 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104503 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.12 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104512 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.15 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104526 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.16 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:41:03.104541 7fa9ae4cb700 -1 osd.11 1672 heartbeat_check: no reply from osd.17 since back 2016-01-07 02:40:49.365340 front 2016-01-07 02:40:49.365340 (cutoff 2016-01-07 02:40:56.104394) 2016-01-07 02:56:17.340268 7fa98b99d700 0 -- 10.0.19.68:6816/10289 submit_message osd_op_reply(201270 105e069.046e [write 0~4194304] v1679'6462 uv6462 ondisk = 0) v6 remote, 10.0.3.68:0/49739, failed lossy con, dropping message 0x30f84fc0 2016-01-07 02:56:17.886032 7fa9ae4cb700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 9.802397 secs 2016-01-07 02:56:17.886195 7fa9ae4cb700 0 log_channel(cluster) log [WRN] : slow request 9.802397 seconds old, received at 2016-01-07 02:56:08.083416: osd_op(client.501311.0:201273 105e069.0471 [write 0~4194304] 7.ea64f958 RETRY=1 snapc 1=[] ondisk+retry+write+known_if_redirected e1679) currently waiting for subops from 3,6 2016-01-07 02:56:18.886521 7fa9ae4cb700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 10.802942 secs 2016-01-07 02:56:18.886626 7fa9ae4cb700 0 log_channel(cluster) log [WRN] : slow request 10.802942 seconds old, received at 2016-01-07 02:56:08.083416: osd_op(client.501311.0:201273
Custom STL allocator
I want your opinion guys regarding two features implemented in attempt to greatly reduce number of memory allocation without major surgery in the code. The features are: 1. Custom STL allocator, which allocates first N items from the STL container itself. This is semi-transparent replacement of standard allocator. Just need to replace std::map with ceph_map for example. Limitation: a) Brakes move semantic. b) No deallocation implemented, so no good for big long living containers. 2. Placement allocator, which allows chained allocation of shorter living object from longer living. Example would be allocation of finish contexts from aio completion context. Limitation: a) May require some code rearrangement in order to avoid concurrent deallocations, otherwise deallocation code uses synchronization what limits performance. b) same as above b) Performance results for 32 threads in a synthetic test, std allocator time to custom allocator time ratio: stlalloc stl+placement alloc block jemalloc tcmalloc ptmalloc jemalloc tcmalloc ptmalloc 1M 1298.01 650.66 137.64 735.49 824.45 9.62 64K 514.84 2.82304.62 570.74 4.8512.21 32K 838.89 2.175.031600.5 7.438.28 4K 2.761.994.984.365.3 8.23 32B 2.675.093.694.418.486.4 (100M test iterations for 32B and 4K, 2M for 32K and 64K, 200K for 1M) I didn¹t see any performance improvement in 100% write fio test, it still can shine in other workloads or proper classes replaced. Let me know if it worth to PR them. STL allocator: https://github.com/efirs/ceph/commit/4eed0d63dbcbd00ee3aa325355bfbe56acbb7b 05 STL allocator usage example: https://github.com/efirs/ceph/commit/362c5c4e10563785cc89370d28511e0493f1b2 11 https://github.com/efirs/ceph/commit/e2df67f7570c68e53775bc55cda12c6253e66d 2f Placement allocator: https://github.com/efirs/ceph/commit/8df5cd7d753fd09e79a24f2fc781cf3af02e6d 3e Placement allocator usage example: https://github.com/efirs/ceph/commit/70db18d9c1b39190bde68548b57c2aa7a9e455 e0 ‹ Evgeniy -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is BlueFS an alternative of BlueStore?
On Thu, 7 Jan 2016, Javen Wu wrote: > Hi Sage, > > Sorry to bother you. I am not sure if it is appropriate to send email to you > directly, but I cannot find any useful information to address my confusion > from Internet. Hope you can help me. > > Occasionally, I heard that you are going to start BlueFS to eliminate the > redudancy between XFS journal and RocksDB WAL. I am a little confused. > Is the Bluefs only to host RocksDB for BlueStore or it's an > alternative of BlueStore? > > I am a new comer to CEPH, I am not sure my understanding is correct about > BlueStore. BlueStore in my mind is as below. > > BlueStore > = >RocksDB > +---+ +---+ > | onode | | | > |WAL| | | > | omap| | | > +---+ | bdev| > | | | | > | XFS | | | > | | | | > +---+ +---+ This is the picture before BlueFS enters the picture. > I am curious if BlueFS is able to host RocksDB, actually it's already a > "filesystem" which have to maintain blockmap kind of metadata by its own > WITHOUT the help of RocksDB. Right. BlueFS is a really simple "file system" that is *just* complicated enough to implement the rocksdb::Env interface, which is what rocksdb needs to store its log and sst files. The after picture looks like ++ | bluestore | +--+ | | rocksdb | | +--+ | | bluefs | | +--+-+ |block device| ++ > The reason we care the intention and the design target of BlueFS is that I had > discussion with my partner Peng.Hse about an idea to introduce a new > ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend > already, but we had a different immature idea to use libzpool to implement a > new > ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module. > So that we can align CEPH transaction and zfs transaction in order to avoid > double write for CEPH journal. > ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and > it's platform kernel/user independent. Another benefit for the idea is we > can extend our metadata without bothering any DBStore. > > Frankly, we are not sure if our idea is realistic so far, but when I heard of > BlueFS, I think we need to know the BlueFS design goal. I think it makes a lot of sense, but there are a few challenges. One reason we use rocksdb (or a similar kv store) is that we need in-order enumeration of objects in order to do collection listing (needed for backfill, scrub, and omap). You'll need something similar on top of zfs. I suspect the simplest path would be to also implement the rocksdb::Env interface on top of the zfs libraries. See BlueRocksEnv.{cc,h} to see the interface that has to be implemented... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is BlueFS an alternative of BlueStore?
Thanks Sage for your reply. I am not sure I understand the challenges you mentioned about backfill/scrub. I will investigate from the code and let you know if we can conquer the challenge by easy means. Our rough idea for ZFSStore are: 1. encapsulate dnode object as onode and add onode attributes. 2. uses ZAP object as collection. (ZFS directory uses ZAP object) 3. enumerating entries in ZAP object is list objects in collection. 4. create a new metaslab class to store CEPH journal. 5. align CEPH journal and ZFS transcation. Actually we've talked about the possibility of building RocksDB::Env on top of the zfs libraries. It must align ZIL(ZFS intent log) and RocksDB WAL. Otherwise, there is still same problem as XFS and RocksDB. ZFS is tree style log structure-like file system, once a leaf block updates, the modification would be propagated from the leaf to the root of tree. To batch writes and reduce times of disk write, ZFS persist modification to disk in 5 seconds transaction. Only when Fsync/sync write arrives in the middle of the 5 seconds, ZFS would persist the journal to ZIL. I remembered RocksDB would do a sync after log record adding, so it means if we can not align ZIL and WAL, the log write would be write to ZIL firstly and then apply ZIL to log file, finally Rockdb update sst file. It's almost the same problem as XFS if my understanding is correct. In my mind, aligning ZIL and WAL need more modifications in RocksDB. Thanks Javen On 2016年01月07日 22:37, peng.hse wrote: Hi Sage, thanks for your quick response. Javen and I once the zfs developer,are currently focusing on how to leverage some of the zfs ideas to improve the ceph backend performance in userspace. Based on your encouraging reply, we come up with 2 schemes to continue our future work 1. the scheme one: using the entire new FS to replace rocksdb+bluefs, the FS itself handles the mapping of oid->fs-object(kind of zfs dnode) and the according attrs used by ceph. Despite the implemention challenges you mentioned about the in-order enumeration of objects during backfill, scrub, etc (the same situation we also confronted in zfs, the ZAP features help us a lot). From performance or architecture point of view, it looks more clear and clean, would you suggest us to give a try ? 2. the scheme two: As your last suspect, we just temporarily implemented the simple version of the FS which leverage libzpool ideas to plug into rocksdb underneath as your bluefs did precious your insightful reply. Thanks On 2016年01月07日 21:19, Sage Weil wrote: On Thu, 7 Jan 2016, Javen Wu wrote: Hi Sage, Sorry to bother you. I am not sure if it is appropriate to send email to you directly, but I cannot find any useful information to address my confusion from Internet. Hope you can help me. Occasionally, I heard that you are going to start BlueFS to eliminate the redudancy between XFS journal and RocksDB WAL. I am a little confused. Is the Bluefs only to host RocksDB for BlueStore or it's an alternative of BlueStore? I am a new comer to CEPH, I am not sure my understanding is correct about BlueStore. BlueStore in my mind is as below. BlueStore = RocksDB +---+ +---+ | onode | | | |WAL| | | | omap| | | +---+ | bdev| | | | | | XFS | | | | | | | +---+ +---+ This is the picture before BlueFS enters the picture. I am curious if BlueFS is able to host RocksDB, actually it's already a "filesystem" which have to maintain blockmap kind of metadata by its own WITHOUT the help of RocksDB. Right. BlueFS is a really simple "file system" that is *just* complicated enough to implement the rocksdb::Env interface, which is what rocksdb needs to store its log and sst files. The after picture looks like ++ | bluestore | +--+ | | rocksdb | | +--+ | | bluefs | | +--+-+ |block device| ++ The reason we care the intention and the design target of BlueFS is that I had discussion with my partner Peng.Hse about an idea to introduce a new ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend already, but we had a different immature idea to use libzpool to implement a new ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module. So that we can align CEPH transaction and zfs transaction in order to avoid double write for CEPH journal. ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and it's platform kernel/user independent. Another benefit for the idea is we can extend our metadata without bothering any DBStore. Frankly, we are not sure
Re: Is BlueFS an alternative of BlueStore?
Hi Sage, thanks for your quick response. Javen and I once the zfs developer,are currently focusing on how to leverage some of the zfs ideas to improve the ceph backend performance in userspace. Based on your encouraging reply, we come up with 2 schemes to continue our future work 1. the scheme one: using the entire new FS to replace rocksdb+bluefs, the FS itself handles the mapping of oid->fs-object(kind of zfs dnode) and the according attrs used by ceph. Despite the implemention challenges you mentioned about the in-order enumeration of objects during backfill, scrub, etc (the same situation we also confronted in zfs, the ZAP features help us a lot). From performance or architecture point of view, it looks more clear and clean, would you suggest us to give a try ? 2. the scheme two: As your last suspect, we just temporarily implemented the simple version of the FS which leverage libzpool ideas to plug into rocksdb underneath as your bluefs did precious your insightful reply. Thanks On 2016年01月07日 21:19, Sage Weil wrote: On Thu, 7 Jan 2016, Javen Wu wrote: Hi Sage, Sorry to bother you. I am not sure if it is appropriate to send email to you directly, but I cannot find any useful information to address my confusion from Internet. Hope you can help me. Occasionally, I heard that you are going to start BlueFS to eliminate the redudancy between XFS journal and RocksDB WAL. I am a little confused. Is the Bluefs only to host RocksDB for BlueStore or it's an alternative of BlueStore? I am a new comer to CEPH, I am not sure my understanding is correct about BlueStore. BlueStore in my mind is as below. BlueStore = RocksDB +---+ +---+ | onode | | | |WAL| | | | omap| | | +---+ | bdev| | | | | | XFS | | | | | | | +---+ +---+ This is the picture before BlueFS enters the picture. I am curious if BlueFS is able to host RocksDB, actually it's already a "filesystem" which have to maintain blockmap kind of metadata by its own WITHOUT the help of RocksDB. Right. BlueFS is a really simple "file system" that is *just* complicated enough to implement the rocksdb::Env interface, which is what rocksdb needs to store its log and sst files. The after picture looks like ++ | bluestore | +--+ | | rocksdb | | +--+ | | bluefs | | +--+-+ |block device| ++ The reason we care the intention and the design target of BlueFS is that I had discussion with my partner Peng.Hse about an idea to introduce a new ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend already, but we had a different immature idea to use libzpool to implement a new ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module. So that we can align CEPH transaction and zfs transaction in order to avoid double write for CEPH journal. ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and it's platform kernel/user independent. Another benefit for the idea is we can extend our metadata without bothering any DBStore. Frankly, we are not sure if our idea is realistic so far, but when I heard of BlueFS, I think we need to know the BlueFS design goal. I think it makes a lot of sense, but there are a few challenges. One reason we use rocksdb (or a similar kv store) is that we need in-order enumeration of objects in order to do collection listing (needed for backfill, scrub, and omap). You'll need something similar on top of zfs. I suspect the simplest path would be to also implement the rocksdb::Env interface on top of the zfs libraries. See BlueRocksEnv.{cc,h} to see the interface that has to be implemented... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
two tarballs for ceph 10.0.1
In http://download.ceph.com/tarballs/ , there's two tarballs: "ceph_10.0.1.orig.tar.gz" and "ceph_10.0.1.orig.tar.gz.1" Which one is correct? Can we delete one? - Ken -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On 6-1-2016 08:51, Mykola Golub wrote: On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote: Hi, Can somebody try to help me and explain why in test: Func: test/mon/osd-crash Func: TEST_crush_reject_empty started Fails with a python error which sort of startles me: test/mon/osd-crush.sh:227: TEST_crush_reject_empty: local empty_map=testdir/osd-crush/empty_map test/mon/osd-crush.sh:228: TEST_crush_reject_empty: : test/mon/osd-crush.sh:229: TEST_crush_reject_empty: ./crushtool -c testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m ap test/mon/osd-crush.sh:230: TEST_crush_reject_empty: expect_failure testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd ir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1171: expect_failure: local dir=testdir/osd-crush ../qa/workunits/ceph-helpers.sh:1172: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1173: expect_failure: local 'expected=Error EINVAL' ../qa/workunits/ceph-helpers.sh:1174: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1175: expect_failure: local success ../qa/workunits/ceph-helpers.sh:1176: expect_failure: pwd ../qa/workunits/ceph-helpers.sh:1177: expect_failure: printenv ../qa/workunits/ceph-helpers.sh:1178: expect_failure: echo ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1180: expect_failure: ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** Traceback (most recent call last): File "./ceph", line 936, in retval = main() File "./ceph", line 874, in main sigdict, inbuf, verbose) File "./ceph", line 457, in new_style_command inbuf=inbuf) File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", line 1208, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd setcrushmap"}']": exception 'utf8' codec can't decode b yte 0x86 in position 56: invalid start byte Which is certainly not the type of error expected. But it is hard to detect any 0x86 in the arguments. Are you able to reproduce this problem manually? I.e. in src dir, start the cluster using vstart.sh: ./vstart.sh -n Check it is running: ./ceph -s Repeat the test: truncate -s 0 empty_map.txt ./crushtool -c empty_map.txt -o empty_map.map ./ceph osd setcrushmap -i empty_map.map Expected output: "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1" Hi all, I've spent the Xmas days trying to learn more about Python. (And catching up with old friends :) ) My heritage is the days of assembler, shell script, C, Perl and likes. So the pony had to learn a few new tricks. (aka language) I'm now trying to get python nosetest to actually work In the mean time I also found that FreeBSD has patches for Googletest to actually make most of the DEATH tests work. I think this python stream pars error got resolved by upgrading everything build, including the complete package environment and upgrading kernel and tools... :) Which I think cleaned out the python environment which was a bit mixed up with different versions. Now test/mon/osd-crush.sh return OKE, so I guess the setup of the environment is relatively critical. I also noted that some of the test get more tests done IF I run them under root-priviledges The last test run resulted in: = ceph 10.0.1: src/test-suite.log = # TOTAL: 120 # PASS: 110 # SKIP: 0 # XFAIL: 0 # FAIL: 10 # XPASS: 0 # ERROR: 0 FAIL ceph-detect-init/run-tox.sh (exit status: 1) FAIL test/run-rbd-unit-tests.sh (exit status: 138) FAIL test/ceph_objectstore_tool.py (exit status: 1) FAIL test/cephtool-test-mon.sh (exit status: 1) FAIL test/cephtool-test-rados.sh (exit status: 1) FAIL test/libradosstriper/rados-striper.sh (exit status: 1) FAIL test/test_objectstore_memstore.sh (exit status: 127) FAIL test/ceph-disk.sh (exit status: 1) FAIL test/pybind/test_ceph_argparse.py (exit status: 127) FAIL test/pybind/test_ceph_daemon.py (exit status: 127) where the first and last 2 actually don't work because of python things that are not working on FreeBSD and I have to sort out. ceph_detect_init.exc.UnsupportedPlatform: Platform is not supported.: ../test-driver: ./test/pybind/test_ceph_argparse.py: not found FAIL test/pybind/test_ceph_argparse.py (exit status: 127) I also have: ./test/test_objectstore_memstore.sh: ./ceph_test_objectstore: not found FAIL test/test_objectstore_memstore.sh (exit status: 127) Which ia a weird one, that needs some TLC. So I'm slowly getting there... --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On 5-1-2016 19:23, Gregory Farnum wrote: On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagenwrote: Hi, Can somebody try to help me and explain why in test: Func: test/mon/osd-crash Func: TEST_crush_reject_empty started Fails with a python error which sort of startles me: test/mon/osd-crush.sh:227: TEST_crush_reject_empty: local empty_map=testdir/osd-crush/empty_map test/mon/osd-crush.sh:228: TEST_crush_reject_empty: : test/mon/osd-crush.sh:229: TEST_crush_reject_empty: ./crushtool -c testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m ap test/mon/osd-crush.sh:230: TEST_crush_reject_empty: expect_failure testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd ir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1171: expect_failure: local dir=testdir/osd-crush ../qa/workunits/ceph-helpers.sh:1172: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1173: expect_failure: local 'expected=Error EINVAL' ../qa/workunits/ceph-helpers.sh:1174: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1175: expect_failure: local success ../qa/workunits/ceph-helpers.sh:1176: expect_failure: pwd ../qa/workunits/ceph-helpers.sh:1177: expect_failure: printenv ../qa/workunits/ceph-helpers.sh:1178: expect_failure: echo ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1180: expect_failure: ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** Traceback (most recent call last): File "./ceph", line 936, in retval = main() File "./ceph", line 874, in main sigdict, inbuf, verbose) File "./ceph", line 457, in new_style_command inbuf=inbuf) File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", line 1208, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd setcrushmap"}']": exception 'utf8' codec can't decode b yte 0x86 in position 56: invalid start byte Which is certainly not the type of error expected. But it is hard to detect any 0x86 in the arguments. And yes python is right, there are no UTF8 sequences that start with 0x86. Question is: Why does it want to parse with UTF8? And how do I switch it off? Or how to I fix this error? I've not handled this myself but we've seen this a few times. The latest example in a quick email search was http://tracker.ceph.com/issues/9405, and it was apparently having a string which wasn't null-terminated. Looks like in my case it was due to too large a mess in the python environment. But I'll keep this in my mind, IFF it comes back to haunt me more. Thanx, --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stable releases preparation temporarily stalled
Hi, The stable releases (hammer, infernalis) did not make progress in the past few weeks because we can't run tests. Before xmas the following happened: * the sepia lab was migrated and we discovered the OpenStack teuthology backend can't run without it (that was a problem during a few days only) * there are OpenStack specific failures in each teuthology suites and it is non trivial to separate them from genuine backport errors * the make check bot went down (it was partially running on my private hardware) If we just wait, I'm not sure when we will be able to resume our work because: * the sepia lab is back but has less horsepower than it did * not all of us have access to the sepia lab * the make check bot is being worked on by the infrastructure team but it is low priority and it may take weeks before it's back online * the ceph-qa-suite errors that are OpenStack specific are low priority and it may never be fixed I think we should rely on the sepia lab for testing for the foreseeable future and wait for the make check bot to be back. Tests will take a long time to run, but we've been able to work with a one week delay before so it's not a blocker. Although fixing OpenStack specific errors would allow us to use the teuthology OpenStack backend (I will fix the last error left in the rados suite), it is unrealistic to set that as a requirement to run tests: we don't have the workforce nor the skills to do that. Hopefully, some time in the future, Ceph developers will use ceph-qa-suite on OpenStack as part of the development workflow. But right now running ceph-qa-suite on OpenStack suites is outside of the development workflow and in a state of continuous regression which is inconvenient for us because we need something stable to compare the runs from the integration branch. Fixing the make check bot is a two part problem. Each failed run must be looked at to chase false negatives (continuous integration with false negatives is a plague), which I did in the past year on a daily basis and I'm happy to keep doing. Before xmas break the bot running at jenkins.ceph.com sent over 90% false negative, primarily because it was trying to run on unsupported operating systems and it was stopped until this is fixed. It also appears that the machine running the bot is not re-imaged after each test, meaning a bugous run may taint all future tests and create a continuous flow of false negative. Addressing these two issues require knowing or learning about the Ceph jenkins setup and slave provisioning. This probably is a few days of work, reason why the infrastructure team can't resolve that immediately. If you have alternative creative ideas on how to improve the current situation, please speak up :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On 6-1-2016 08:51, Mykola Golub wrote: > > Are you able to reproduce this problem manually? I.e. in src dir, start the > cluster using vstart.sh: > > ./vstart.sh -n > > Check it is running: > > ./ceph -s > > Repeat the test: > > truncate -s 0 empty_map.txt > ./crushtool -c empty_map.txt -o empty_map.map > ./ceph osd setcrushmap -i empty_map.map > > Expected output: > > "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1" > Oke thanx Nice to have some of these examples... --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
01/06/2016 Weekly Ceph Performance Meeting IS ON!
8AM PST as usual (ie in 18 minutes)! Discussion topics today include bluestore testing results and a potential performance regression in CentOS/RHEL 7.1 kernels. Please feel free to add your own topics! Here's the links: Etherpad URL: http://pad.ceph.com/p/performance_weekly To join the Meeting: https://bluejeans.com/268261044 To join via Browser: https://bluejeans.com/268261044/browser To join with Lync: https://bluejeans.com/268261044/lync To join via Room System: Video Conferencing System: bjn.vc -or- 199.48.152.152 Meeting ID: 268261044 To join via Phone: 1) Dial: +1 408 740 7256 +1 888 240 2560(US Toll Free) +1 408 317 9253(Alternate Number) (see all numbers - http://bluejeans.com/numbers) 2) Enter Conference ID: 268261044 Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Is BlueFS an alternative of BlueStore?
Hi Sage, Sorry to bother you. I am not sure if it is appropriate to send email to you directly, but I cannot find any useful information to address my confusion from Internet. Hope you can help me. Occasionally, I heard that you are going to start BlueFS to eliminate the redudancy between XFS journal and RocksDB WAL. I am a little confused. Is the Bluefs only to host RocksDB for BlueStore or it's an alternative of BlueStore? I am a new comer to CEPH, I am not sure my understanding is correct about BlueStore. BlueStore in my mind is as below. BlueStore = RocksDB +---+ +---+ | onode | | | |WAL| | | | omap| | | +---+ | bdev| | | | | | XFS | | | | | | | +---+ +---+ I am curious if BlueFS is able to host RocksDB, actually it's already a "filesystem" which have to maintain blockmap kind of metadata by its own WITHOUT the help of RocksDB. When BlueFS is introduced into the picture, why RocksDB is needed yet? So I guess BlueFS is an alternative of BlueStore and it's a new ObjectStore without leveraging RocksDB. Is my understanding correct? The reason we care the intention and the design target of BlueFS is that I had discussion with my partner Peng.Hse about an idea to introduce a new ObjectStore using ZFS library. I know CEPH supports ZFS as FileStore backend already, but we had a different immature idea to use libzpool to implement a new ObjectStore for CEPH totally in userspace without SPL and ZOL kernel module. So that we can align CEPH transaction and zfs transaction in order to avoid double write for CEPH journal. ZFS core part libzpool (DMU, metaslab etc) offers a dnode object store and it's platform kernel/user independent. Another benefit for the idea is we can extend our metadata without bothering any DBStore. Frankly, we are not sure if our idea is realistic so far, but when I heard of BlueFS, I think we need to know the BlueFS design goal. Thanks Javen -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 01/06/2016 Weekly Ceph Performance Meeting IS ON!
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 The last recording I'm seeing is for 10/07/15. Can we get the newer ones? Thanks, - Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Jan 6, 2016 at 8:43 AM, Mark Nelson wrote: > 8AM PST as usual (ie in 18 minutes)! Discussion topics today include > bluestore testing results and a potential performance regression in > CentOS/RHEL 7.1 kernels. Please feel free to add your own topics! > > Here's the links: > > Etherpad URL: > http://pad.ceph.com/p/performance_weekly > > To join the Meeting: > https://bluejeans.com/268261044 > > To join via Browser: > https://bluejeans.com/268261044/browser > > To join with Lync: > https://bluejeans.com/268261044/lync > > > To join via Room System: > Video Conferencing System: bjn.vc -or- 199.48.152.152 > Meeting ID: 268261044 > > To join via Phone: > 1) Dial: > +1 408 740 7256 > +1 888 240 2560(US Toll Free) > +1 408 317 9253(Alternate Number) > (see all numbers - http://bluejeans.com/numbers) > 2) Enter Conference ID: 268261044 > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -BEGIN PGP SIGNATURE- Version: Mailvelope v1.3.2 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWjYwHCRDmVDuy+mK58QAAXGUP/1L+iRyYLxfhI1hyuhCM +nnrS41HPZ4oTeirmo9jKLj6eOBB/NStaoJpDxibkKsrkQSI/CZs9c1mZiF1 t0Jm+PEWy7N30lLgjCh8UUU+J6PMG450xABFeuJfriWPuS4WkKsstlsdhlWd IcFJUlGotlagjA57tEW5DaaEqg8SKoykIOs7nnhIUkezHfB51fjyYQKH7/XB kINLDigl4KjDVrpijCa80E9Kg1T+4wR/tRDOOSWzyQJtLRpwrZBAL8X8Ab9p WEqryr0MudicgG5kasZLbaS/edcuvV2UsNjtTBmJRQlf26TzZMIFkJNeE9Z6 89QPFCcuCRe7aNBG7zU8GAmU2Tg2ZcBGDySBJ2GVh8Fjx81VKPQTUSWEaPt7 8THlVkE8oV3OfOmeJE7uiKYR4U2X2WIS8Y0ThUJCLbZZfjoQB03oEiY6+3yd jXL+/27fxFu2bvC/ODWHbT+EcB6S+dnJzOYl44oxOCkdP+TjjPP7jcvgyBwe N+Yx3i6G0QUFt/QtKfztN9vsqh0oZ8OWA0Idj++V1vFVls0o1NvswLo2fAqc aLbGzKfkxRDEszr6zgwoVegDBGSd7LbCShEd9RjmTblayqYZbCblal12ZQ2D MJv0iqwWAPQSHuJUhw/DOzf9cz1YNU8hrRF+myxlwvjwno7xPrEx3/SibHEX 5E7/ =ySZ7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy
This is odd. We are signing all packages before publishing them on the repository. These ceph-deploy releases are following a new release process so I will have to investigate where is the disconnect. Thanks for letting us know. On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnellwrote: > It looks like the ceph-deploy > 1.5.28 packages in the > http://download.ceph.com/rpm-hammer/el6 and > http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP > signed. What happened? This is causing our yum updates to fail but may > be a sign of something much more nefarious? > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm > 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b > > e2afe9d26 > d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000 > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm > (none) > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm > (none) > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm > 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71 > > 1cd9cc7f2 > 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016 > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm > (none) > > # rpm -qp --queryformat %{SIGPGP} > http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm > (none) > > -- > Derek T. Yarnell > University of Maryland > Institute for Advanced Computer Studies > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On Mon, Dec 28, 2015 at 8:53 AM, Willem Jan Withagenwrote: > Hi, > > Can somebody try to help me and explain why > > in test: Func: test/mon/osd-crash > Func: TEST_crush_reject_empty started > > Fails with a python error which sort of startles me: > test/mon/osd-crush.sh:227: TEST_crush_reject_empty: local > empty_map=testdir/osd-crush/empty_map > test/mon/osd-crush.sh:228: TEST_crush_reject_empty: : > test/mon/osd-crush.sh:229: TEST_crush_reject_empty: ./crushtool -c > testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m > ap > test/mon/osd-crush.sh:230: TEST_crush_reject_empty: expect_failure > testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd > ir/osd-crush/empty_map.map > ../qa/workunits/ceph-helpers.sh:1171: expect_failure: local > dir=testdir/osd-crush > ../qa/workunits/ceph-helpers.sh:1172: expect_failure: shift > ../qa/workunits/ceph-helpers.sh:1173: expect_failure: local 'expected=Error > EINVAL' > ../qa/workunits/ceph-helpers.sh:1174: expect_failure: shift > ../qa/workunits/ceph-helpers.sh:1175: expect_failure: local success > ../qa/workunits/ceph-helpers.sh:1176: expect_failure: pwd > ../qa/workunits/ceph-helpers.sh:1177: expect_failure: printenv > ../qa/workunits/ceph-helpers.sh:1178: expect_failure: echo ./ceph osd > setcrushmap -i testdir/osd-crush/empty_map.map > ../qa/workunits/ceph-helpers.sh:1180: expect_failure: ./ceph osd > setcrushmap -i testdir/osd-crush/empty_map.map > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > Traceback (most recent call last): > File "./ceph", line 936, in > retval = main() > File "./ceph", line 874, in main > sigdict, inbuf, verbose) > File "./ceph", line 457, in new_style_command > inbuf=inbuf) > File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", > line 1208, in json_command > raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) > RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd > setcrushmap"}']": exception 'utf8' codec can't decode b > yte 0x86 in position 56: invalid start byte > > Which is certainly not the type of error expected. > But it is hard to detect any 0x86 in the arguments. > > And yes python is right, there are no UTF8 sequences that start with 0x86. > Question is: > Why does it want to parse with UTF8? > And how do I switch it off? > Or how to I fix this error? I've not handled this myself but we've seen this a few times. The latest example in a quick email search was http://tracker.ceph.com/issues/9405, and it was apparently having a string which wasn't null-terminated. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CBT on an existing cluster
Having trouble getting a reply from c...@cbt.com so trying ceph-devel list... To get familiar with CBT, I first wanted to use it on an existing cluster. (i.e., not have CBT do any cluster setup). Is there a .yaml example that illustrates how to use cbt to run for example, its radosbench benchmark on an existing cluster? -- Tom Deneau, AMD -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CBT on an existing cluster
On Tue, Jan 5, 2016 at 9:56 AM, Deneau, Tomwrote: > Having trouble getting a reply from c...@cbt.com so trying ceph-devel list... > > To get familiar with CBT, I first wanted to use it on an existing cluster. > (i.e., not have CBT do any cluster setup). > > Is there a .yaml example that illustrates how to use cbt to run for example, > its radosbench benchmark on an existing cluster? I dunno anything about CBT, but I don't see any emails from you on that list and the correct address is c...@lists.ceph.com (rather than the other way around), so let's try that. :) -Greg PS: next reply drop ceph-devel, please! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy
It looks like this was only for ceph-deploy in Hammer. I verified that this wasn't the case in e.g. Infernalis I have ensured that the ceph-deploy packages in hammer are in fact signed and coming from our builds. Thanks again for reporting this! On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Dezawrote: > This is odd. We are signing all packages before publishing them on the > repository. These ceph-deploy releases are following a new release > process so I will > have to investigate where is the disconnect. > > Thanks for letting us know. > > On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell wrote: >> It looks like the ceph-deploy > 1.5.28 packages in the >> http://download.ceph.com/rpm-hammer/el6 and >> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP >> signed. What happened? This is causing our yum updates to fail but may >> be a sign of something much more nefarious? >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm >> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b >> >> e2afe9d26 >> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000 >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm >> (none) >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm >> (none) >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm >> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71 >> >> 1cd9cc7f2 >> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016 >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm >> (none) >> >> # rpm -qp --queryformat %{SIGPGP} >> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm >> (none) >> >> -- >> Derek T. Yarnell >> University of Maryland >> Institute for Advanced Computer Studies >> ___ >> ceph-users mailing list >> ceph-us...@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy
Hi Alfredo, I am still having a bit of trouble though with what looks like the 1.5.31 release. With a `yum update ceph-deploy` I get the following even after a full `yum clean all`. http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm: [Errno -1] Package does not match intended download. Suggestion: run yum --enablerepo=Ceph-noarch clean metadata Thanks, derek On 1/5/16 1:25 PM, Alfredo Deza wrote: > It looks like this was only for ceph-deploy in Hammer. I verified that > this wasn't the case in e.g. Infernalis > > I have ensured that the ceph-deploy packages in hammer are in fact > signed and coming from our builds. > > Thanks again for reporting this! > > On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Dezawrote: >> This is odd. We are signing all packages before publishing them on the >> repository. These ceph-deploy releases are following a new release >> process so I will >> have to investigate where is the disconnect. >> >> Thanks for letting us know. >> >> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell wrote: >>> It looks like the ceph-deploy > 1.5.28 packages in the >>> http://download.ceph.com/rpm-hammer/el6 and >>> http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP >>> signed. What happened? This is causing our yum updates to fail but may >>> be a sign of something much more nefarious? >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm >>> 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2 df9b >>> >>> e2afe9d26 >>> d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000 >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm >>> (none) >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm >>> (none) >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm >>> 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771 ef71 >>> >>> 1cd9cc7f2 >>> 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016 >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm >>> (none) >>> >>> # rpm -qp --queryformat %{SIGPGP} >>> http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm >>> (none) >>> >>> -- >>> Derek T. Yarnell >>> University of Maryland >>> Institute for Advanced Computer Studies >>> ___ >>> ceph-users mailing list >>> ceph-us...@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PGP signatures for RHEL hammer RPMs for ceph-deploy
It looks like the ceph-deploy > 1.5.28 packages in the http://download.ceph.com/rpm-hammer/el6 and http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP signed. What happened? This is causing our yum updates to fail but may be a sign of something much more nefarious? # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2df9b e2afe9d26 d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000 # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771ef71 1cd9cc7f2 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016 # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm (none) -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
deprecation and build warnings
I was annoyed again at our gitbuilders being all yellow because of compile warnings so I went to check out how many of them are real and how many of them are self-inflicted warnings. I just spot-checked http://gitbuilder.sepia.ceph.com/gitbuilder-ceph-tarball-trusty-amd64-basic/log.cgi?log=2694e1171f23166e8a11c57c7b284621498decd8, but much to my pleasant surprise there are only two errors: 1) we have 16 uses of rados_ioctx_pool_required_alignment, which is deprecated. 2) we have two uses of libec_isa.so being linked against a loadable module. Both of these are contained entirely in our unit tests. I don't know exactly what's going on with the second one, but I imagine it's not a difficult fix? For the first one, can we just stop testing it? Or in some way suppress the warning for those callers? I'd love to have some green show up on the dashboard again, so that it's not too hard to notice when we introduce actual build regressions. ;) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Docs now building again
https://github.com/ceph/ceph/pull/7119 fixed an issue preventing docs from building. Master is fixed; merge that into your branches if you want working docs again. -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On Mon, Dec 28, 2015 at 05:53:04PM +0100, Willem Jan Withagen wrote: > Hi, > > Can somebody try to help me and explain why > > in test: Func: test/mon/osd-crash > Func: TEST_crush_reject_empty started > > Fails with a python error which sort of startles me: > test/mon/osd-crush.sh:227: TEST_crush_reject_empty: local > empty_map=testdir/osd-crush/empty_map > test/mon/osd-crush.sh:228: TEST_crush_reject_empty: : > test/mon/osd-crush.sh:229: TEST_crush_reject_empty: ./crushtool -c > testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m > ap > test/mon/osd-crush.sh:230: TEST_crush_reject_empty: expect_failure > testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd > ir/osd-crush/empty_map.map > ../qa/workunits/ceph-helpers.sh:1171: expect_failure: local > dir=testdir/osd-crush > ../qa/workunits/ceph-helpers.sh:1172: expect_failure: shift > ../qa/workunits/ceph-helpers.sh:1173: expect_failure: local 'expected=Error > EINVAL' > ../qa/workunits/ceph-helpers.sh:1174: expect_failure: shift > ../qa/workunits/ceph-helpers.sh:1175: expect_failure: local success > ../qa/workunits/ceph-helpers.sh:1176: expect_failure: pwd > ../qa/workunits/ceph-helpers.sh:1177: expect_failure: printenv > ../qa/workunits/ceph-helpers.sh:1178: expect_failure: echo ./ceph osd > setcrushmap -i testdir/osd-crush/empty_map.map > ../qa/workunits/ceph-helpers.sh:1180: expect_failure: ./ceph osd > setcrushmap -i testdir/osd-crush/empty_map.map > *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** > Traceback (most recent call last): > File "./ceph", line 936, in > retval = main() > File "./ceph", line 874, in main > sigdict, inbuf, verbose) > File "./ceph", line 457, in new_style_command > inbuf=inbuf) > File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", > line 1208, in json_command > raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) > RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd > setcrushmap"}']": exception 'utf8' codec can't decode b > yte 0x86 in position 56: invalid start byte > > Which is certainly not the type of error expected. > But it is hard to detect any 0x86 in the arguments. Are you able to reproduce this problem manually? I.e. in src dir, start the cluster using vstart.sh: ./vstart.sh -n Check it is running: ./ceph -s Repeat the test: truncate -s 0 empty_map.txt ./crushtool -c empty_map.txt -o empty_map.map ./ceph osd setcrushmap -i empty_map.map Expected output: "Error EINVAL: Failed crushmap test: ./crushtool: exit status: 1" -- Mykola Golub -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] PGP signatures for RHEL hammer RPMs for ceph-deploy
It seems that the metadata didn't get updated. I just tried out and got the right version with no issues. Hopefully *this* time it works for you. Sorry for all the troubles On Tue, Jan 5, 2016 at 3:21 PM, Derek Yarnellwrote: > Hi Alfredo, > > I am still having a bit of trouble though with what looks like the > 1.5.31 release. With a `yum update ceph-deploy` I get the following > even after a full `yum clean all`. > > http://ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.31-0.noarch.rpm: > [Errno -1] Package does not match intended download. Suggestion: run yum > --enablerepo=Ceph-noarch clean metadata > > Thanks, > derek > > On 1/5/16 1:25 PM, Alfredo Deza wrote: >> It looks like this was only for ceph-deploy in Hammer. I verified that >> this wasn't the case in e.g. Infernalis >> >> I have ensured that the ceph-deploy packages in hammer are in fact >> signed and coming from our builds. >> >> Thanks again for reporting this! >> >> On Tue, Jan 5, 2016 at 12:27 PM, Alfredo Deza wrote: >>> This is odd. We are signing all packages before publishing them on the >>> repository. These ceph-deploy releases are following a new release >>> process so I will >>> have to investigate where is the disconnect. >>> >>> Thanks for letting us know. >>> >>> On Tue, Jan 5, 2016 at 10:31 AM, Derek Yarnell wrote: It looks like the ceph-deploy > 1.5.28 packages in the http://download.ceph.com/rpm-hammer/el6 and http://download.ceph.com/rpm-hammer/el7 repositories are not being PGP signed. What happened? This is causing our yum updates to fail but may be a sign of something much more nefarious? # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.28-0.noarch.rpm 89021c040001020006050255fae0d5000a0910e84ac2c0460f3994203610009e284c0c6749f9d1ccd54aca8668e5f4148eb60f0ade762a5cb316926060d73a82490c41b8a5e9a5ebb8a7136a5ce294565cf8548dce160f7a577b623f12fb841b1656fba0b139404b4a074c076abf8c38f176bbecfc551567d22826d6c3ac2a67d8c8f4db67e3a2566272f492f3a1461b2c80bfc56f0c29e3a0c0e03fe50ee877d2d2b99963ea876914f5d85ae6fcf60c7c372040fcc82591552af21e152a37ab4103c3116ccd3a5f10992dc9ec483922212ef8ad8c37abbb6a751f6da2cc79567ed45e7bcb83d92aecc2a61d7584699183622714376bf3766e8781c7675834cce7d3e6c349bee6992872248fe7dd9f00248806e0c99f1a7010a8e77d13fefffeb142c1ee4ee8e55e53043fb89b7127a1c2282f4ab0fa3d19eccaa38194aa42310860bdd7746de8512b106d7923e9da9d1ad84b4ba1f8a3175b808d08f99ca5b737d4a7cba1f165b815187bec9ff1e0b5627e435ed869ae0bb16419e928e1a64413bb4dd62a6b1b049faa02eaa14bd6636b5f835bfef16acfd2daad82c1fed57a5e635971281367d2fe99c3b2b542490559d9b9b3f4295c86185aa3c4b4014da55c1b0ff68bc42c869729fee29472c413c911ea9bc5d58957bfb670ddc54d28fd8f30444969b790e53f9d34a1b2 > > df9b e2afe9d26 d5be57b9fcd659c4880fad613ba5f175e4e3466dba4919a4656ffd228688a9c81d865e6df870ba33bbfc000 # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.29-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el6/noarch/ceph-deploy-1.5.30-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.28-0.noarch.rpm 89021c040001020006050255fadf42000a0910e84ac2c0460f39943b131000cb7f253c91019b2f5993fd232c4369003d521538aa19f996717d2eee780fe2d7ed4e969418ce92d6ad4be69b3c5421b80d2241a9d6e72e758ba86f0360e24aadd63d89165b47a566bcd8bed39d7b37e809d7afdf6b38e5e014f98caca6df7da6278822e2457c627cdba505febc23edb32447e11c2878e79bf5f5690def708ed7d79d261a839d5808b177cb3d6a8bc62317441f3e1b5cf986aeb5cde98fc986c42af2761418e7e83309df9b8703648a8e6eefe83f9d3cbcfe371bc336320657f86343ab25df8bd578203b6f312746ebbe0da195adeb1087487d12d530281b5328731c54240b0c5c01f1648c8802231876a33a0835a553e1b84e6d8a15acdd5db6b6bf9c6dee84b22ae0e70dc0cf2acdd5779e510a248844bba0af87ae8d5a874502ec0e48b235926222cf3386c44e30e3af14dea6134a5873784013297fa19a09f439bc8a2b73f563fc6e5cfa60767629a37f3cd24762f7b14e5f7ce08adeed82da3effc59298359a9f7f0efab0e4e808a33ceb07431530e0c279462da043bbece02d3fdf6a96e5a813eea0bf0f73e84b7fac6e28449e1bf15ddc2fa692f641ce8d4d9ed4261ba2824adee47dad90993ebc46d6ee083e92c8f76aaf8428e274e48cb1a91d0a2eb15e8779289b3771 > > ef71 1cd9cc7f2 8f7a3cde708e4577b0aad546024ee98646f4f543ee1e33d8c96a93cff9b48deefa5b3996f659b16786ff016 # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.29-0.noarch.rpm (none) # rpm -qp --queryformat %{SIGPGP} http://download.ceph.com/rpm-hammer/el7/noarch/ceph-deploy-1.5.30-0.noarch.rpm (none) -- Derek T. Yarnell University of Maryland Institute for Advanced Computer Studies ___ ceph-users mailing list ceph-us...@lists.ceph.com
Re: Long peering - throttle at FileStore::queue_transactions
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weilwrote: > On Mon, 4 Jan 2016, Guang Yang wrote: >> Hi Cephers, >> Happy New Year! I got question regards to the long PG peering.. >> >> Over the last several days I have been looking into the *long peering* >> problem when we start a OSD / OSD host, what I observed was that the >> two peering working threads were throttled (stuck) when trying to >> queue new transactions (writing pg log), thus the peering process are >> dramatically slow down. >> >> The first question came to me was, what were the transactions in the >> queue? The major ones, as I saw, included: >> >> - The osd_map and incremental osd_map, this happens if the OSD had >> been down for a while (in a large cluster), or when the cluster got >> upgrade, which made the osd_map epoch the down OSD had, was far behind >> the latest osd_map epoch. During the OSD booting, it would need to >> persist all those osd_maps and generate lots of filestore transactions >> (linear with the epoch gap). >> > As the PG was not involved in most of those epochs, could we only take and >> > persist those osd_maps which matter to the PGs on the OSD? > > This part should happen before the OSD sends the MOSDBoot message, before > anyone knows it exists. There is a tunable threshold that controls how > recent the map has to be before the OSD tries to boot. If you're > seeing this in the real world, be probably just need to adjust that value > way down to something small(er). It would queue the transactions and then sends out the MOSDBoot, thus there is still a chance that it could have contention with the peering OPs (especially on large clusters where there are lots of activities which generates many osdmap epoch). Any chance we can change the *queue_transactions* to "apply_transactions*, thus we block there waiting for the persistent of the osdmap. At least we may be able to do that during OSD booting? The concern is, if the OSD is active, the apply_transaction would take longer with holding the osd_lock.. I don't find such tuning, could you elaborate? Thanks! > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
hammer mon failure
http://tracker.ceph.com/issues/14236 New hammer mon failure in the nightlies (missing a map apparently?), can you take a look? -Sam -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: hammer mon failure
On 01/05/2016 07:55 PM, Samuel Just wrote: > http://tracker.ceph.com/issues/14236 > > New hammer mon failure in the nightlies (missing a map apparently?), > can you take a look? > -Sam Will do. -Joao -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Is rbd map/unmap op. configured like an event?
Hi All, Is rbd map/unmap op. configured like an event in the directory of /etc/init, so we can use system/upstart to automanage it? - wukongming ID: 12019 Tel:0571-86760239 Dept:2014 UIS2 ONEStor - 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 邮件! This e-mail and its attachments contain confidential information from H3C, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Long peering - throttle at FileStore::queue_transactions
Hi Cephers, Happy New Year! I got question regards to the long PG peering.. Over the last several days I have been looking into the *long peering* problem when we start a OSD / OSD host, what I observed was that the two peering working threads were throttled (stuck) when trying to queue new transactions (writing pg log), thus the peering process are dramatically slow down. The first question came to me was, what were the transactions in the queue? The major ones, as I saw, included: - The osd_map and incremental osd_map, this happens if the OSD had been down for a while (in a large cluster), or when the cluster got upgrade, which made the osd_map epoch the down OSD had, was far behind the latest osd_map epoch. During the OSD booting, it would need to persist all those osd_maps and generate lots of filestore transactions (linear with the epoch gap). > As the PG was not involved in most of those epochs, could we only take and > persist those osd_maps which matter to the PGs on the OSD? - There are lots of deletion transactions, and as the PG booting, it needs to merge the PG log from its peers, and for the deletion PG entry, it would need to queue the deletion transaction immediately. > Could we delay the queue of the transactions until all PGs on the host are > peered? Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD data file are OSD logs
IIRC, you are running giant. I think that's the log rotate dangling fd bug (not fixed in giant since giant is eol). Fixed upstream 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is b8e3f6e190809febf80af66415862e7c7e415214. -Sam On Mon, Jan 4, 2016 at 3:37 PM, Guang Yangwrote: > Hi Cephers, > Before I open a tracker, I would like check if it is a known issue or not.. > > One one of our clusters, there was OSD crash during repairing, the > crash happened after we issued a PG repair for inconsistent PGs, which > failed because the recorded file size (within xattr) mismatched with > the actual file size. > > The mismatch was caused by the fact that the content of the data file > are OSD logs, following is from osd.354 on c003: > > -rw-r--r-- 1 yahoo root 75168 Jan 3 07:30 > default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7 > -bash-4.1$ head > "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7" > 2016-01-03 07:30:01.600119 7f7fe2096700 15 > filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs > 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7 > 2016-01-03 07:30:01.604967 7f7fe2096700 10 > filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, len is 494 > 2016-01-03 07:30:01.604984 7f7fe2096700 10 > filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, got 247 > 2016-01-03 07:30:01.604986 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > '_user.rgw.idtag' > 2016-01-03 07:30:01.604996 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_' > 2016-01-03 07:30:01.605007 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > 'snapset' > 2016-01-03 07:30:01.605013 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > '_user.rgw.manifest' > 2016-01-03 07:30:01.605026 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > 'hinfo_key' > 2016-01-03 07:30:01.605042 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > '_user.rgw.x-amz-meta-origin' > 2016-01-03 07:30:01.605049 7f7fe2096700 20 > filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting > '_user.rgw.acl' > > > This only happens on the clusters we turned on the verbose log > (debug_osd/filestore=20). And we are running ceph v0.87. > > Thanks, > Guang > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD data file are OSD logs
Thanks Sam for the confirmation. Thanks, Guang On Mon, Jan 4, 2016 at 3:59 PM, Samuel Justwrote: > IIRC, you are running giant. I think that's the log rotate dangling > fd bug (not fixed in giant since giant is eol). Fixed upstream > 8778ab3a1ced7fab07662248af0c773df759653d, firefly backport is > b8e3f6e190809febf80af66415862e7c7e415214. > -Sam > > On Mon, Jan 4, 2016 at 3:37 PM, Guang Yang wrote: >> Hi Cephers, >> Before I open a tracker, I would like check if it is a known issue or not.. >> >> One one of our clusters, there was OSD crash during repairing, the >> crash happened after we issued a PG repair for inconsistent PGs, which >> failed because the recorded file size (within xattr) mismatched with >> the actual file size. >> >> The mismatch was caused by the fact that the content of the data file >> are OSD logs, following is from osd.354 on c003: >> >> -rw-r--r-- 1 yahoo root 75168 Jan 3 07:30 >> default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7 >> -bash-4.1$ head >> "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7" >> 2016-01-03 07:30:01.600119 7f7fe2096700 15 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs >> 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7 >> 2016-01-03 07:30:01.604967 7f7fe2096700 10 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, len is 494 >> 2016-01-03 07:30:01.604984 7f7fe2096700 10 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, got 247 >> 2016-01-03 07:30:01.604986 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.idtag' >> 2016-01-03 07:30:01.604996 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_' >> 2016-01-03 07:30:01.605007 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> 'snapset' >> 2016-01-03 07:30:01.605013 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.manifest' >> 2016-01-03 07:30:01.605026 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> 'hinfo_key' >> 2016-01-03 07:30:01.605042 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.x-amz-meta-origin' >> 2016-01-03 07:30:01.605049 7f7fe2096700 20 >> filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting >> '_user.rgw.acl' >> >> >> This only happens on the clusters we turned on the verbose log >> (debug_osd/filestore=20). And we are running ceph v0.87. >> >> Thanks, >> Guang >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD data file are OSD logs
Hi Cephers, Before I open a tracker, I would like check if it is a known issue or not.. One one of our clusters, there was OSD crash during repairing, the crash happened after we issued a PG repair for inconsistent PGs, which failed because the recorded file size (within xattr) mismatched with the actual file size. The mismatch was caused by the fact that the content of the data file are OSD logs, following is from osd.354 on c003: -rw-r--r-- 1 yahoo root 75168 Jan 3 07:30 default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7 -bash-4.1$ head "default.12061.9\u8396947527\u52ac8b3ec6\uo.jpg__head_A2478171__3__7" 2016-01-03 07:30:01.600119 7f7fe2096700 15 filestore(/home/y/var/lib/ceph/osd/ceph-354) getattrs 3.171s7_head/a2478171/default.12061.9_8396947527_52ac8b3ec6_o.jpg/head//3/18446744073709551615/7 2016-01-03 07:30:01.604967 7f7fe2096700 10 filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, len is 494 2016-01-03 07:30:01.604984 7f7fe2096700 10 filestore(/home/y/var/lib/ceph/osd/ceph-354) -ERANGE, got 247 2016-01-03 07:30:01.604986 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.idtag' 2016-01-03 07:30:01.604996 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_' 2016-01-03 07:30:01.605007 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting 'snapset' 2016-01-03 07:30:01.605013 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.manifest' 2016-01-03 07:30:01.605026 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting 'hinfo_key' 2016-01-03 07:30:01.605042 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.x-amz-meta-origin' 2016-01-03 07:30:01.605049 7f7fe2096700 20 filestore(/home/y/var/lib/ceph/osd/ceph-354) fgetattrs 61 getting '_user.rgw.acl' This only happens on the clusters we turned on the verbose log (debug_osd/filestore=20). And we are running ceph v0.87. Thanks, Guang -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long peering - throttle at FileStore::queue_transactions
We need every OSDMap persisted before persisting later ones because we rely on there being no holes for a bunch of reasons. The deletion transactions are more interesting. It's not part of the boot process, these are deletions resulting from merging in a log from a peer which logically removed an object. It's more noticeable on boot because all PGs will see these operations at once (if there are a bunch of deletes happening). We need to process these transactions before we can serve reads (before we activate) currently since we use the on disk state (modulo the objectcontext locks) as authoritative. That transaction iirc also contains the updated PGLog. We can't avoid writing down the PGLog prior to activation, but we *can* delay the deletes (and even batch/throttle them) if we do some work: 1) During activation, we need to maintain a set of to-be-deleted objects. For each of these objects, we need to populate the objectcontext cache with an exists=false objectcontext so that we don't erroneously read the deleted data. Each of the entries in the to-be-deleted object set would have a reference to the context to keep it alive until the deletion is processed. 2) Any write operation which references one of these objects needs to be preceded by a delete if one has not yet been queued (and the to-be-deleted set updated appropriately). The tricky part is that the primary and replicas may have different objects in this set... The replica would have to insert deletes ahead of any subop (or the ec equilivant) it gets from the primary. For that to work, it needs to have something like the obc cache. I have a wip-replica-read branch which refactors object locking to allow the replica to maintain locks (to avoid replica-reads conflicting with writes). That machinery would probably be the right place to put it. 3) We need to make sure that if a node restarts anywhere in this process that it correctly repopulates the set of to be deleted entries. We might consider a deleted-to version in the log? Not sure about this one since it would be different on the replica and the primary. Anyway, it's actually more complicated than you'd expect and will require more design (and probably depends on wip-replica-read landing). -Sam On Mon, Jan 4, 2016 at 3:32 PM, Guang Yangwrote: > Hi Cephers, > Happy New Year! I got question regards to the long PG peering.. > > Over the last several days I have been looking into the *long peering* > problem when we start a OSD / OSD host, what I observed was that the > two peering working threads were throttled (stuck) when trying to > queue new transactions (writing pg log), thus the peering process are > dramatically slow down. > > The first question came to me was, what were the transactions in the > queue? The major ones, as I saw, included: > > - The osd_map and incremental osd_map, this happens if the OSD had > been down for a while (in a large cluster), or when the cluster got > upgrade, which made the osd_map epoch the down OSD had, was far behind > the latest osd_map epoch. During the OSD booting, it would need to > persist all those osd_maps and generate lots of filestore transactions > (linear with the epoch gap). >> As the PG was not involved in most of those epochs, could we only take and >> persist those osd_maps which matter to the PGs on the OSD? > > - There are lots of deletion transactions, and as the PG booting, it > needs to merge the PG log from its peers, and for the deletion PG > entry, it would need to queue the deletion transaction immediately. >> Could we delay the queue of the transactions until all PGs on the host are >> peered? > > Thanks, > Guang > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Long peering - throttle at FileStore::queue_transactions
On Mon, 4 Jan 2016, Guang Yang wrote: > Hi Cephers, > Happy New Year! I got question regards to the long PG peering.. > > Over the last several days I have been looking into the *long peering* > problem when we start a OSD / OSD host, what I observed was that the > two peering working threads were throttled (stuck) when trying to > queue new transactions (writing pg log), thus the peering process are > dramatically slow down. > > The first question came to me was, what were the transactions in the > queue? The major ones, as I saw, included: > > - The osd_map and incremental osd_map, this happens if the OSD had > been down for a while (in a large cluster), or when the cluster got > upgrade, which made the osd_map epoch the down OSD had, was far behind > the latest osd_map epoch. During the OSD booting, it would need to > persist all those osd_maps and generate lots of filestore transactions > (linear with the epoch gap). > > As the PG was not involved in most of those epochs, could we only take and > > persist those osd_maps which matter to the PGs on the OSD? This part should happen before the OSD sends the MOSDBoot message, before anyone knows it exists. There is a tunable threshold that controls how recent the map has to be before the OSD tries to boot. If you're seeing this in the real world, be probably just need to adjust that value way down to something small(er). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benachrichtigung
Sehr geehrte / ter email Benützer ! Ihre email Adresse hat 1.20,00 (EINEMILLIONZWEIHUNDERTAUSEND EURO) gewonnen . Mit den Glückszahlen 9-3-8-26-28-4-64 In der EURO MILLIONEN EMAIL LOTTERIE.Die Summe ergibt sich aus einer Gewinnausschuttung von. 22.800,000,00 ( ZWEIUNDZWANZIGMILLIONENACHTHUNDERTTOUSEND ) Die Summe wurde durch 19 Gewinnern aus der gleichen Kategorie geteilt. ! Bitte kontaktieren Sie für Ihren Gewinn zuständige Sachbearbeiterin Frau Christiane Hamann per email: christiane_hama...@aol.com BITTE AUSFUILLEN DEIN DATAS AUS UNTEN. Glückszahlen:___ NAME: ___FAMILIENNAME:_ ADRESSE:__ STADT: PLZ: LAND: ___ GEB: DATUM: __BERUF: FESTNETZ TEL.NR: MOBILETELEFON NR: ___FAX: ___ EMAIL:___ DATE SIGNATURE:_ bitte füllen Sie das anschließende Formular vollständig aus und senden es per email zurück ! Hochachtungsvoll Inmaculada Garcia Martinez Koordinator. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Charity/Donation
Hi, My name is Jeffrey Skoll, a philanthropist and the founder of one of the largest private foundations in the world. I believe strongly in ‘giving while living.’ I had one idea that never changed in my mind — that you should use your wealth to help people and I have decided to secretly give USD2.498 Million to a randomly selected individual. On receipt of this email, you should count yourself as the individual. Kindly get back to me at your earliest convenience, so I know your email address is valid. Visit the web page to know more about me: http://www.theglobeandmail.com/news/national/meet-the-canadian-billionaire-whos-giving-it-all-away/article4209888/ or you can read an article of me on Wikipedia. Regards, Jeffrey Skoll. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Speeding up rbd_stat() in libvirt
Short term, assuming there wouldn't be an objection from the libvirt community, I think spawning a thread pool and concurrently executing several rbd_stat calls concurrently would be the easiest and cleanest solution. I wouldn't suggest trying to roll your own solution for retrieving image sizes for format 1 and 2 RBD images directly within libvirt. Longer term, given this use case, perhaps it would make sense to add an async version of rbd_open. The rbd_stat call itself just reads the data from memory initialized by rbd_open. On the Jewel branch, librbd has had some major rework and image loading is asynchronous under the hood already. -- Jason Dillaman - Original Message - > From: "Wido den Hollander"> To: ceph-devel@vger.kernel.org > Sent: Monday, December 28, 2015 8:48:40 AM > Subject: Speeding up rbd_stat() in libvirt > > Hi, > > The storage pools of libvirt know a mechanism called 'refresh' which > will scan a storage pool to refresh the contents. > > The current implementation does: > * List all images via rbd_list() > * Call rbd_stat() on each image > > Source: > http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329 > > This works, but a RBD pool with 10k images takes a couple of minutes to > scan. > > Now, Ceph is distributed, so this could be done in parallel, but before > I start on this I was wondering if somebody had a good idea to fix this? > > I don't know if it is allowed in libvirt to spawn multiple threads and > have workers do this, but it was something which came to mind. > > libvirt only wants to know the size of a image and this is now stored in > the rbd_directory object, so the rbd_stat() is required. > > Suggestions or ideas? I would like to have this process to be as fast as > possible. > > Wido > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Speeding up rbd_stat() in libvirt
On 04-01-16 16:38, Jason Dillaman wrote: > Short term, assuming there wouldn't be an objection from the libvirt > community, I think spawning a thread pool and concurrently executing several > rbd_stat calls concurrently would be the easiest and cleanest solution. I > wouldn't suggest trying to roll your own solution for retrieving image sizes > for format 1 and 2 RBD images directly within libvirt. > I'll ask in the libvirt community if they allow such a thing. > Longer term, given this use case, perhaps it would make sense to add an async > version of rbd_open. The rbd_stat call itself just reads the data from > memory initialized by rbd_open. On the Jewel branch, librbd has had some > major rework and image loading is asynchronous under the hood already. > Hmm, that would be nice. In the callback I could call rbd_stat() and populate the volume list within libvirt. I would very much like to go that route since it saves me a lot of code inside libvirt ;) Wido -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Create one millon empty files with cephfs
On Tue, Dec 29, 2015 at 4:55 AM, Fengguang Gongwrote: > hi, > We create one million empty files through filebench, here is the test env: > MDS: one MDS > MON: one MON > OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica > Network: all nodes are connected through 10 gigabit network > > We use more than one client to create files, to test the scalability of > MDS. Here are the results: > IOPS under one client: 850 > IOPS under two client: 1150 > IOPS under four client: 1180 > > As we can see, the IOPS almost maintains unchanged when the number of > client increase from 2 to 4. > > Cephfs may have a low scalability under one MDS, and we think its the big > lock in > MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock), > who limits the > scalability of MDS. > > We think this big lock could be removed through the following steps: > 1. separate the process of ClientRequest with other requests, so we can > parallel the process > of ClientRequest > 2. use some small granularity locks instead of big lock to ensure > consistency > > Wondering this idea is reasonable? Parallelizing the MDS is probably a very big job; it's on our radar but not for a while yet. If one were to do it, yes, breaking down the big MDS lock would be the way forward. I'm not sure entirely what that involves — you'd need to significantly chunk up the locking on our more critical data structures, most especially the MDCache. Luckily there is *some* help there in terms of the file cap locking structures we already have in place, but it's a *huge* project and not one to be undertaken lightly. A special processing mechanism for ClientRequests versus other requests is not an assumption I'd start with. I think you'll find that file creates are just about the least scalable thing you can do on CephFS right now, though, so there is some easier ground. One obvious approach is to extend the current inode preallocation — it already allocates inodes per-client and has a fast path inside of the MDS for handing them back. It'd be great if clients were aware of that preallocation and could create files without waiting for the MDS to talk back to them! The issue with this is two-fold: 1) need to update the cap flushing protocol to deal with files newly created by the client 2) need to handle all the backtrace stuff normally performed by the MDS on file create (which still needs to happen, on either the client or the server) There's also clean up in case of a client failure, but we've already got a model for that in how we figure out real file sizes and things based on max size. I think there's a ticket about this somewhere, but I can't find it off-hand... -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 答复: Reboot blocked when undoing unmap op.
On Mon, Jan 4, 2016 at 10:51 AM, Wukongmingwrote: > Hi, Ilya, > > It is an old problem. > When you say "when you issue a reboot, daemons get killed and the kernel > client ends up waiting for the them to come back, because of outstanding > writes issued by umount called by systemd (or whatever)." > > Do you mean if umount rbd successfully, the process of kernel client will > stop waiting? What kind of Communication mechanism between libceph and > daemons(or ceph userspace)? If you umount the filesystem on top of rbd and unmap rbd image, there won't be anything to wait for. In fact, if there aren't any other rbd images mapped, libceph will clean up after itself and exit. If you umount the filesystem on top of rbd but don't unmap the image, libceph will remain there, along with some amount of communication (keepalive messages, watch requests, etc). However, all of that is internal and is unlikely to block reboot. If you don't umount the filesystem, your init system will try to umount it, issuing FS requests to the rbd device. We don't want to drop those requests, so, if daemons are gone by then, libceph ends up blocking. Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to configure if there are tow network cards in Client
it would certainly help those with less knowledge about networking in linux, though i do not know how many people using ceph are in this category. Sage and the others here may have a better idea about its feasibility. but i usually use rule-* and route-* (in CentOS) files, they work with networkmanager, and very easy to configure. in ubuntu you can put them in interfaces file, and they are as easy. if such a tool is made, i think it should understand the ceph.conf file, but i doubt it can figure out the routes correctly without you putting them in. On 12/29/2015 03:58 PM, 蔡毅 wrote: Thank for your replies. So is it reasonable that we could write a file such as shell script to bind one process with a specific IP and modify the routing tables and rules as one of Ceph’s tools? So that the users is convenient when they want to change the NIC connecting with the OSD. At 2015-12-29 18:21:21, "Linux Chips"wrote: On 12/28/2015 07:47 PM, Sage Weil wrote: On Fri, 25 Dec 2015, ?? wrote: Hi all, When we read the code, we haven?t find the function that the client can bind a specific IP. In Ceph?s configuration, we could only find the parameter ?public network?, but it seems acts on the OSD but not the client. There is a scenario that the client has two network cards named NIC1 and NIC2. The NIC1 is responsible for communicating with cluster (monitor and RADOS) and the NIC2 has other services except Ceph?s client. So we need the client can bind specific IP in order to differentiate the IP communicating with cluster from another IP serving other applications. We want to know is there any configuration in Ceph to achieve this function? If there is, how could we configure the IP? if not, could we add this function in Ceph? Thank you so much. you can use routing tables plus routing rules. otherwise linux will just use the default gateway. or you can put the second interface on the same public net of ceph. though that would break if you have multiple external nets. Right. There isn't a configurable to do this now--we've always just let the kernel network layer sort it out. Is this just a matter of calling bind on the socket before connecting? I've never done this before.. linux will send all packets to the default gateway event if an application binds to an ip on different interface, the packet will go out with the source address as the binded one but through your router. the only solution, even if the bind function exists is to use the routing tables and rules. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message tomajord...@vger.kernel.org More majordomo info athttp://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: how io works when backfill
On Tue, 29 Dec 2015, Dong Wu wrote: > if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3] --> pg1.0 > [7, 2, 3], is it similar with the example above? > still install a pg_temp entry mapping the PG back to [1, 2, 3], then > backfill happens to 7, normal io write to [1, 2, 3], if io to the > portion of the PG that has already been backfilled will also be sent > to osd.7? Yes (although I forget how it picks the ordering of the osds in the temp mapping). See PG::choose_acting() for the details. > how about these examples about removing an osd: > - pg1.0 [1, 2, 3] > - osd.3 down and be removed > - mapping changes to [1, 2, 5], but osd.5 has no data, then install a > pg_temp mapping the PG back to [1, 2], then backfill happens to 5, > - normal io write to [1, 2], if io hits object which has been > backfilled to osd.5, io will also send to osd.5 > - when backfill completes, remove the pg_temp and mapping changes back > to [1, 2, 5] Yes > another example: > - pg1.0 [1, 2, 3] > - osd.3 down and be removed > - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then > install a pg_temp mapping the PG back to [1, 2] which osd.1 > temporarily becomes the primary, then backfill happens to 5, > - normal io write to [1, 2], if io hits object which has been > backfilled to osd.5, io will also send to osd.5 > - when backfill completes, remove the pg_temp and mapping changes back > to [5, 1, 2] > > is my ananysis right? Yep! sage > > 2015-12-29 1:30 GMT+08:00 Sage Weil: > > On Mon, 28 Dec 2015, Zhiqiang Wang wrote: > >> 2015-12-27 20:48 GMT+08:00 Dong Wu : > >> > Hi, > >> > When add osd or remove osd, ceph will backfill to rebalance data. > >> > eg: > >> > - pg1.0[1, 2, 3] > >> > - add an osd(eg. osd.7) > >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] > >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now > >> > object a is backfilling > >> > - when a write io hits object a, then the io needs to wait for its > >> > complete, then goes on. > >> > - but if io hits object b which has not been backfilled, io reaches > >> > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not > >> > have object b, so osd.7 needs to wait for object b to backfilled, then > >> > write. Is it right? Or osd.1 only send the io to osd.2, not both? > >> > >> I think in this case, when the write of object b reaches osd.1, it > >> holds the client write, raises the priority of the recovery of object > >> b, and kick off the recovery of it. When the recovery of object b is > >> done, it requeue the client write, and then everything goes like > >> usual. > > > > It's more complicated than that. In a normal (log-based) recovery > > situation, it is something like the above: if the acting set is [1,2,3] > > but 3 is missing the latest copy of A, a write to A will block on the > > primary while the primary initiates recovery of A immediately. Once that > > completes the IO will continue. > > > > For backfill, it's different. In your example, you start with [1,2,3] > > then add in osd.7. The OSD will see that 7 has no data for teh PG and > > install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then > > things will proceed normally while backfill happens to 7. Backfill won't > > interfere with normal IO at all, except that IO to the portion of the PG > > that has already been backfilled will also be sent to the backfill target > > (7) so that it stays up to date. Once it complets, the pg_temp entry is > > removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to > > remove it's copy of the PG. > > > > sage > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Create one millon empty files with cephfs
hi, We create one million empty files through filebench, here is the test env: MDS: one MDS MON: one MON OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica Network: all nodes are connected through 10 gigabit network We use more than one client to create files, to test the scalability of MDS. Here are the results: IOPS under one client: 850 IOPS under two client: 1150 IOPS under four client: 1180 As we can see, the IOPS almost maintains unchanged when the number of client increase from 2 to 4. Cephfs may have a low scalability under one MDS, and we think its the big lock in MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock), who limits the scalability of MDS. We think this big lock could be removed through the following steps: 1. separate the process of ClientRequest with other requests, so we can parallel the process of ClientRequest 2. use some small granularity locks instead of big lock to ensure consistency Wondering this idea is reasonable? thanks -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Create one millon empty files with cephfs
hi, We create one million empty files through filebench, here is the test env: MDS: one MDS MON: one MON OSD: two OSD, each with one Inter P3700; data on OSD with 2x replica Network: all nodes are connected through 10 gigabit network We use more than one client to create files, to test the scalability of MDS. Here are the results: IOPS under one client: 850 IOPS under two client: 1150 IOPS under four client: 1180 As we can see, the IOPS almost maintains unchanged when the number of client increase from 2 to 4. Cephfs may have a low scalability under one MDS, and we think its the big lock in MDSDamon::ms_dispatch()::Mutex::locker(every request acquires this lock), who limits the scalability of MDS. We think this big lock could be removed through the following steps: 1. separate the process of ClientRequest with other requests, so we can parallel the process of ClientRequest 2. use some small granularity locks instead of big lock to ensure consistency Wondering this idea is reasonable? thanks -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re:Re: How to configure if there are tow network cards in Client
Thank for your replies. So is it reasonable that we could write a file such as shell script to bind one process with a specific IP and modify the routing tables and rules as one of Ceph’s tools? So that the users is convenient when they want to change the NIC connecting with the OSD. At 2015-12-29 18:21:21, "Linux Chips"wrote: >On 12/28/2015 07:47 PM, Sage Weil wrote: >> On Fri, 25 Dec 2015, ?? wrote: >>> Hi all, >>> When we read the code, we haven?t find the function that the client >>> can bind a specific IP. In Ceph?s configuration, we could only find the >>> parameter ?public network?, but it seems acts on the OSD but not the client. >>> There is a scenario that the client has two network cards named NIC1 >>> and NIC2. The NIC1 is responsible for communicating with cluster (monitor >>> and RADOS) and the NIC2 has other services except Ceph?s client. So we >>> need the client can bind specific IP in order to differentiate the IP >>> communicating with cluster from another IP serving other applications. We >>> want to know is there any configuration in Ceph to achieve this function? >>> If there is, how could we configure the IP? if not, could we add this >>> function in Ceph? Thank you so much. >you can use routing tables plus routing rules. otherwise linux will just >use the default gateway. >or you can put the second interface on the same public net of ceph. >though that would break if you have multiple external nets. >> Right. There isn't a configurable to do this now--we've always just let >> the kernel network layer sort it out. Is this just a matter of calling >> bind on the socket before connecting? I've never done this before.. >linux will send all packets to the default gateway event if an >application binds to an ip on different interface, the packet will go >out with the source address as the binded one but through your router. >the only solution, even if the bind function exists is to use the >routing tables and rules. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > >
Re: How to configure if there are tow network cards in Client
On 12/28/2015 07:47 PM, Sage Weil wrote: On Fri, 25 Dec 2015, ?? wrote: Hi all, When we read the code, we haven?t find the function that the client can bind a specific IP. In Ceph?s configuration, we could only find the parameter ?public network?, but it seems acts on the OSD but not the client. There is a scenario that the client has two network cards named NIC1 and NIC2. The NIC1 is responsible for communicating with cluster (monitor and RADOS) and the NIC2 has other services except Ceph?s client. So we need the client can bind specific IP in order to differentiate the IP communicating with cluster from another IP serving other applications. We want to know is there any configuration in Ceph to achieve this function? If there is, how could we configure the IP? if not, could we add this function in Ceph? Thank you so much. you can use routing tables plus routing rules. otherwise linux will just use the default gateway. or you can put the second interface on the same public net of ceph. though that would break if you have multiple external nets. Right. There isn't a configurable to do this now--we've always just let the kernel network layer sort it out. Is this just a matter of calling bind on the socket before connecting? I've never done this before.. linux will send all packets to the default gateway event if an application binds to an ip on different interface, the packet will go out with the source address as the binded one but through your router. the only solution, even if the bind function exists is to use the routing tables and rules. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: how io works when backfill
if add in osd.7 and 7 becomes the primary: pg1.0 [1, 2, 3] --> pg1.0 [7, 2, 3], is it similar with the example above? still install a pg_temp entry mapping the PG back to [1, 2, 3], then backfill happens to 7, normal io write to [1, 2, 3], if io to the portion of the PG that has already been backfilled will also be sent to osd.7? how about these examples about removing an osd: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [1, 2, 5], but osd.5 has no data, then install a pg_temp mapping the PG back to [1, 2], then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [1, 2, 5] another example: - pg1.0 [1, 2, 3] - osd.3 down and be removed - mapping changes to [5, 1, 2], but osd.5 has no data of the pg, then install a pg_temp mapping the PG back to [1, 2] which osd.1 temporarily becomes the primary, then backfill happens to 5, - normal io write to [1, 2], if io hits object which has been backfilled to osd.5, io will also send to osd.5 - when backfill completes, remove the pg_temp and mapping changes back to [5, 1, 2] is my ananysis right? 2015-12-29 1:30 GMT+08:00 Sage Weil: > On Mon, 28 Dec 2015, Zhiqiang Wang wrote: >> 2015-12-27 20:48 GMT+08:00 Dong Wu : >> > Hi, >> > When add osd or remove osd, ceph will backfill to rebalance data. >> > eg: >> > - pg1.0[1, 2, 3] >> > - add an osd(eg. osd.7) >> > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] >> > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now >> > object a is backfilling >> > - when a write io hits object a, then the io needs to wait for its >> > complete, then goes on. >> > - but if io hits object b which has not been backfilled, io reaches >> > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not >> > have object b, so osd.7 needs to wait for object b to backfilled, then >> > write. Is it right? Or osd.1 only send the io to osd.2, not both? >> >> I think in this case, when the write of object b reaches osd.1, it >> holds the client write, raises the priority of the recovery of object >> b, and kick off the recovery of it. When the recovery of object b is >> done, it requeue the client write, and then everything goes like >> usual. > > It's more complicated than that. In a normal (log-based) recovery > situation, it is something like the above: if the acting set is [1,2,3] > but 3 is missing the latest copy of A, a write to A will block on the > primary while the primary initiates recovery of A immediately. Once that > completes the IO will continue. > > For backfill, it's different. In your example, you start with [1,2,3] > then add in osd.7. The OSD will see that 7 has no data for teh PG and > install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then > things will proceed normally while backfill happens to 7. Backfill won't > interfere with normal IO at all, except that IO to the portion of the PG > that has already been backfilled will also be sent to the backfill target > (7) so that it stays up to date. Once it complets, the pg_temp entry is > removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to > remove it's copy of the PG. > > sage > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: how io works when backfill
2015-12-27 20:48 GMT+08:00 Dong Wu: > Hi, > When add osd or remove osd, ceph will backfill to rebalance data. > eg: > - pg1.0[1, 2, 3] > - add an osd(eg. osd.7) > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now > object a is backfilling > - when a write io hits object a, then the io needs to wait for its > complete, then goes on. > - but if io hits object b which has not been backfilled, io reaches > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not > have object b, so osd.7 needs to wait for object b to backfilled, then > write. Is it right? Or osd.1 only send the io to osd.2, not both? I think in this case, when the write of object b reaches osd.1, it holds the client write, raises the priority of the recovery of object b, and kick off the recovery of it. When the recovery of object b is done, it requeue the client write, and then everything goes like usual. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Speeding up rbd_stat() in libvirt
Hi, The storage pools of libvirt know a mechanism called 'refresh' which will scan a storage pool to refresh the contents. The current implementation does: * List all images via rbd_list() * Call rbd_stat() on each image Source: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=cdbfdee98505492407669130712046783223c3cf;hb=master#l329 This works, but a RBD pool with 10k images takes a couple of minutes to scan. Now, Ceph is distributed, so this could be done in parallel, but before I start on this I was wondering if somebody had a good idea to fix this? I don't know if it is allowed in libvirt to spawn multiple threads and have workers do this, but it was something which came to mind. libvirt only wants to know the size of a image and this is now stored in the rbd_directory object, so the rbd_stat() is required. Suggestions or ideas? I would like to have this process to be as fast as possible. Wido -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
Hi, Can somebody try to help me and explain why in test: Func: test/mon/osd-crash Func: TEST_crush_reject_empty started Fails with a python error which sort of startles me: test/mon/osd-crush.sh:227: TEST_crush_reject_empty: local empty_map=testdir/osd-crush/empty_map test/mon/osd-crush.sh:228: TEST_crush_reject_empty: : test/mon/osd-crush.sh:229: TEST_crush_reject_empty: ./crushtool -c testdir/osd-crush/empty_map.txt -o testdir/osd-crush/empty_map.m ap test/mon/osd-crush.sh:230: TEST_crush_reject_empty: expect_failure testdir/osd-crush 'Error EINVAL' ./ceph osd setcrushmap -i testd ir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1171: expect_failure: local dir=testdir/osd-crush ../qa/workunits/ceph-helpers.sh:1172: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1173: expect_failure: local 'expected=Error EINVAL' ../qa/workunits/ceph-helpers.sh:1174: expect_failure: shift ../qa/workunits/ceph-helpers.sh:1175: expect_failure: local success ../qa/workunits/ceph-helpers.sh:1176: expect_failure: pwd ../qa/workunits/ceph-helpers.sh:1177: expect_failure: printenv ../qa/workunits/ceph-helpers.sh:1178: expect_failure: echo ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map ../qa/workunits/ceph-helpers.sh:1180: expect_failure: ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** Traceback (most recent call last): File "./ceph", line 936, in retval = main() File "./ceph", line 874, in main sigdict, inbuf, verbose) File "./ceph", line 457, in new_style_command inbuf=inbuf) File "/usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/pybind/ceph_argparse.py", line 1208, in json_command raise RuntimeError('"{0}": exception {1}'.format(argdict, e)) RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd setcrushmap"}']": exception 'utf8' codec can't decode b yte 0x86 in position 56: invalid start byte Which is certainly not the type of error expected. But it is hard to detect any 0x86 in the arguments. And yes python is right, there are no UTF8 sequences that start with 0x86. Question is: Why does it want to parse with UTF8? And how do I switch it off? Or how to I fix this error? Thanx, --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: how io works when backfill
On Mon, 28 Dec 2015, Zhiqiang Wang wrote: > 2015-12-27 20:48 GMT+08:00 Dong Wu: > > Hi, > > When add osd or remove osd, ceph will backfill to rebalance data. > > eg: > > - pg1.0[1, 2, 3] > > - add an osd(eg. osd.7) > > - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] > > - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now > > object a is backfilling > > - when a write io hits object a, then the io needs to wait for its > > complete, then goes on. > > - but if io hits object b which has not been backfilled, io reaches > > osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not > > have object b, so osd.7 needs to wait for object b to backfilled, then > > write. Is it right? Or osd.1 only send the io to osd.2, not both? > > I think in this case, when the write of object b reaches osd.1, it > holds the client write, raises the priority of the recovery of object > b, and kick off the recovery of it. When the recovery of object b is > done, it requeue the client write, and then everything goes like > usual. It's more complicated than that. In a normal (log-based) recovery situation, it is something like the above: if the acting set is [1,2,3] but 3 is missing the latest copy of A, a write to A will block on the primary while the primary initiates recovery of A immediately. Once that completes the IO will continue. For backfill, it's different. In your example, you start with [1,2,3] then add in osd.7. The OSD will see that 7 has no data for teh PG and install a pg_temp entry mapping the PG back to [1,2,3] temporarily. Then things will proceed normally while backfill happens to 7. Backfill won't interfere with normal IO at all, except that IO to the portion of the PG that has already been backfilled will also be sent to the backfill target (7) so that it stays up to date. Once it complets, the pg_temp entry is removed and the mapping changes back to [1,2,7]. Then osd.3 is allowed to remove it's copy of the PG. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to configure if there are tow network cards in Client
On Fri, 25 Dec 2015, ?? wrote: > Hi all, > When we read the code, we haven?t find the function that the client can > bind a specific IP. In Ceph?s configuration, we could only find the parameter > ?public network?, but it seems acts on the OSD but not the client. > There is a scenario that the client has two network cards named NIC1 and > NIC2. The NIC1 is responsible for communicating with cluster (monitor and > RADOS) and the NIC2 has other services except Ceph?s client. So we need the > client can bind specific IP in order to differentiate the IP communicating > with cluster from another IP serving other applications. We want to know is > there any configuration in Ceph to achieve this function? If there is, how > could we configure the IP? if not, could we add this function in Ceph? Thank > you so much. Right. There isn't a configurable to do this now--we've always just let the kernel network layer sort it out. Is this just a matter of calling bind on the socket before connecting? I've never done this before.. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Abhishek Varshney2015-11-23 11:45:29 +0530 infernalis-backports Adam C. Emerson 2015-12-21 16:51:39 -0500 wip-cxx11concurrency Adam Crume 2014-12-01 20:45:58 -0800 wip-doc-rbd-replay Alfredo Deza 2015-03-23 16:39:48 -0400 wip-11212 2015-12-23 11:25:13 -0500 wip-doc-style Alfredo Deza 2014-07-08 13:58:35 -0400 wip-8679 2014-09-04 13:58:14 -0400 wip-8366 2014-10-13 11:10:10 -0400 wip-9730 Ali Maredia 2015-11-25 13:45:29 -0500 wip-10587-split-servers 2015-12-23 12:01:46 -0500 wip-cmake 2015-12-23 16:12:47 -0500 wip-cmake-rocksdb Barbora AnÄincová 2015-11-04 16:43:45 +0100 wip-doc-RGW Boris Ranto 2015-09-04 15:19:11 +0200 wip-bash-completion Daniel Gryniewicz 2015-11-11 09:06:00 -0500 wip-rgw-storage-class 2015-12-09 12:56:37 -0500 cmake-dang Danny Al-Gaaf 2015-04-23 16:32:00 +0200 wip-da-SCA-20150421 2015-04-23 17:18:57 +0200 wip-nosetests 2015-04-23 18:20:16 +0200 wip-unify-num_objects_degraded 2015-11-03 14:10:47 +0100 wip-da-SCA-20151029 2015-11-03 14:40:44 +0100 wip-da-SCA-20150910 David Zafman 2014-08-29 10:41:23 -0700 wip-libcommon-rebase 2015-04-24 13:14:23 -0700 wip-cot-giant 2015-09-28 11:33:11 -0700 wip-12983 2015-12-22 16:19:25 -0800 wip-zafman-testing Dongmao Zhang 2014-11-14 19:14:34 +0800 thesues-master Greg Farnum 2015-04-29 21:44:11 -0700 wip-init-names 2015-07-16 09:28:24 -0700 hammer-12297 2015-10-02 13:00:59 -0700 greg-infernalis-lock-testing 2015-10-02 13:09:05 -0700 greg-infernalis-lock-testing-cacher 2015-10-07 00:45:24 -0700 greg-infernalis-fs 2015-10-21 17:43:07 -0700 client-pagecache-norevoke 2015-10-27 11:32:46 -0700 hammer-pg-replay 2015-11-24 07:17:33 -0800 greg-fs-verify 2015-12-11 00:24:40 -0800 greg-fs-testing Greg Farnum 2014-10-23 13:33:44 -0700 wip-forward-scrub Guang G Yang 2015-06-26 20:31:44 + wip-ec-readall 2015-07-23 16:13:19 + wip-12316 Guang Yang 2014-09-25 00:47:46 + wip-9008 2015-10-20 15:30:41 + wip-13441 Haomai Wang 2015-10-26 00:02:04 +0800 wip-13521 Haomai Wang 2014-07-27 13:37:49 +0800 wip-flush-set 2015-04-20 00:47:59 +0800 update-organization 2015-07-21 19:33:56 +0800 fio-objectstore 2015-08-26 09:57:27 +0800 wip-recovery-attr 2015-10-24 23:39:07 +0800 fix-compile-warning Hector Martin 2015-12-03 03:07:02 +0900 wip-cython-rbd Ilya Dryomov 2014-09-05 16:15:10 +0400 wip-rbd-notify-errors Ivo Jimenez 2015-08-24 23:12:45 -0700 hammer-with-new-workunit-for-wip-12551 James Page 2015-11-04 11:08:42 + javacruft-wip-ec-modules Jason Dillaman 2015-08-31 23:17:53 -0400 wip-12698 2015-11-13 02:00:21 -0500 wip-11287-rebased Jenkins 2015-11-04 14:31:13 -0800 rhcs-v0.94.3-ubuntu Jenkins 2014-07-29 05:24:39 -0700 wip-nhm-hang 2014-10-14 12:10:38 -0700 wip-2 2015-02-02 10:35:28 -0800 wip-sam-v0.92 2015-08-21 12:46:32 -0700 last 2015-08-21 12:46:32 -0700 loic-v9.0.3 2015-09-15 10:23:18 -0700 rhcs-v0.80.8 2015-09-21 16:48:32 -0700 rhcs-v0.94.1-ubuntu Joao Eduardo Luis 2014-09-10 09:39:23 +0100 wip-leveldb-get.dumpling Joao Eduardo Luis 2014-07-22 15:41:42 +0100 wip-leveldb-misc Joao Eduardo Luis 2014-09-02 17:19:52 +0100 wip-leveldb-get 2014-10-17 16:20:11 +0100 wip-paxos-fix 2014-10-21 21:32:46 +0100 wip-9675.dumpling 2015-07-27 21:56:42 +0100 wip-11470.hammer 2015-09-09 15:45:45 +0100 wip-11786.hammer Joao Eduardo Luis 2014-11-17 16:43:53 + wip-mon-osdmap-cleanup 2014-12-15 16:18:56 + wip-giant-mon-backports 2014-12-17 17:13:57 + wip-mon-backports.firefly 2014-12-17 23:15:10 + wip-mon-sync-fix.dumpling 2015-01-07 23:01:00 + wip-mon-blackhole-mlog-0.87.7 2015-01-10 02:40:42 + wip-dho-joao 2015-01-10 02:46:31 +
Cordial greeting
Cordial greeting message from Fatima, I am seeking for your help,I will be very glad if you do assist me to relocate a sum of (US$4 Million Dollars) into your Bank account in your country for the benefit of both of us i want to use this money for investment. I will give you more details as you reply Yours Eva Zahra Robert -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CEPH build
Hi, resending my letter. Thank you for the attention. Best regards, Vladislav Odintsov From: Sage WeilSent: Monday, December 28, 2015 19:49 To: Odintsov Vladislav Subject: Re: CEPH build Can you resend this to ceph-devel, and copy ad...@redhat.com? On Fri, 25 Dec 2015, Odintsov Vladislav wrote: > > Hi, Sage! > > > I'm working at Cloud provider as a system engineer, and now > I'm trying to build different versions of CEPH (0.94, 9.2, 10.0) with libxio > enabled, and I've got a problem with understanding, how do ceph maintainers > create official tarballs and builds from git repo. > > I saw you as a maintainer of build related files in a repo, and thought you > can help me :) If I'm wrong, please, say me, who can do it. > > I've found very many information sources with different description of ceph > build process: > > - https://github.com/ceph/ceph-build > > - https://github.com/ceph/autobuild-ceph > > - documentation on ceph.docs. > > > But I'm unable to get the same tarball as > at http://download.ceph.com/tarballs/ > > for example for version v0.94.5. What else should I read? Or, maybe there is > some magic...) > > > Actually, I want understand how official builds are made (which tools), I'd > like to go through all build related steps by myself to understand the > upstream building process. > > > Thanks a lot for your help! > > > > Best regards, > > Vladislav Odintsov -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
how io works when backfill
Hi, When add osd or remove osd, ceph will backfill to rebalance data. eg: - pg1.0[1, 2, 3] - add an osd(eg. osd.7) - ceph start backfill, then pg1.0 osd set changes to [1, 2, 7] - if [a, b, c, d, e] are objects needing to backfill to osd.7 and now object a is backfilling - when a write io hits object a, then the io needs to wait for its complete, then goes on. - but if io hits object b which has not been backfilled, io reaches osd.1, then osd.1 send the io to osd.2 and osd.7, but osd.7 does not have object b, so osd.7 needs to wait for object b to backfilled, then write. Is it right? Or osd.1 only send the io to osd.2, not both? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Thank you for your reply. I am looking formard to Sage's opinion too @sage. Also I'll keep on with the BlueStore and Kstore's progress. Regards 2015-12-25 14:48 GMT+08:00 Ning Yao: > Hi, Dong Wu, > > 1. As I currently work for other things, this proposal is abandon for > a long time > 2. This is a complicated task as we need to consider a lots such as > (not just for writeOp, as well as truncate, delete) and also need to > consider the different affects for different backends(Replicated, EC). > 3. I don't think it is good time to redo this patch now, since the > BlueStore and Kstore is inprogress, and I'm afraid to bring some > side-effect. We may prepare and propose the whole design in next CDS. > 4. Currently, we already have some tricks to deal with recovery (like > throttle the max recovery op, set the priority for recovery and so > on). So this kind of patch may not solve the critical problem but just > make things better, and I am not quite sure that this will really > bring a big improvement. Based on my previous test, it works > excellently on slow disk (say hdd), and also for a short-time > maintaining. Otherwise, it will trigger the backfill process. So wait > for Sage's opinion @sage > > If you are interest on this, we may cooperate to do this. > > Regards > Ning Yao > > > 2015-12-25 14:23 GMT+08:00 Dong Wu : >> Thanks, from this pull request I learned that this issue is not >> completed, is there any new progress of this issue? >> >> 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) : >>> Yeah, This is good idea for recovery, but not for backfill. >>> @YaoNing have pull a request about this >>> https://github.com/ceph/ceph/pull/3837 this year. >>> >>> 2015-12-25 11:16 GMT+08:00 Dong Wu : Hi, I have doubt about pglog, the pglog contains (op,object,version) etc. when peering, use pglog to construct missing list,then recover the whole object in missing list even if different data among replicas is less then a whole object data(eg,4MB). why not add (offset,len) to pglog? If so, the missing list can contain (object, offset, len), then we can reduce recover data. ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> -- >>> Regards, >>> Xinze Chi >> ___ >> ceph-users mailing list >> ceph-us...@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to configure if there are tow network cards in Client
Hi all, When we read the code, we haven’t find the function that the client can bind a specific IP. In Ceph’s configuration, we could only find the parameter “public network”, but it seems acts on the OSD but not the client. There is a scenario that the client has two network cards named NIC1 and NIC2. The NIC1 is responsible for communicating with cluster (monitor and RADOS) and the NIC2 has other services except Ceph’s client. So we need the client can bind specific IP in order to differentiate the IP communicating with cluster from another IP serving other applications. We want to know is there any configuration in Ceph to achieve this function? If there is, how could we configure the IP? if not, could we add this function in Ceph? Thank you so much. Best regards, Cai Yi
Re: [ceph-users] why not add (offset,len) to pglog
On Fri, 25 Dec 2015, Ning Yao wrote: > Hi, Dong Wu, > > 1. As I currently work for other things, this proposal is abandon for > a long time > 2. This is a complicated task as we need to consider a lots such as > (not just for writeOp, as well as truncate, delete) and also need to > consider the different affects for different backends(Replicated, EC). > 3. I don't think it is good time to redo this patch now, since the > BlueStore and Kstore is inprogress, and I'm afraid to bring some > side-effect. We may prepare and propose the whole design in next CDS. > 4. Currently, we already have some tricks to deal with recovery (like > throttle the max recovery op, set the priority for recovery and so > on). So this kind of patch may not solve the critical problem but just > make things better, and I am not quite sure that this will really > bring a big improvement. Based on my previous test, it works > excellently on slow disk (say hdd), and also for a short-time > maintaining. Otherwise, it will trigger the backfill process. So wait > for Sage's opinion @sage > > If you are interest on this, we may cooperate to do this. I think it's a great idea. We didn't do it before only because it is complicated. The good news is that if we can't conclusively infer exactly which parts of hte object need to be recovered from the log entry we can always just fall back to recovering the whole thing. Also, the place where this is currently most visible is RBD small writes: - osd goes down - client sends a 4k overwrite and modifies an object - osd comes back up - client sends another 4k overwrite - client io blocks while osd recovers 4mb So even if we initially ignore truncate and omap and EC and clones and anything else complicated I suspect we'll get a nice benefit. I haven't thought about this too much, but my guess is that the hard part is making the primary's missing set representation include a partial delta (say, an interval_set<> indicating which ranges of the file have changed) in a way that gracefully degrades to recovering the whole object if we're not sure. In any case, we should definitely have the design conversation! sage > > Regards > Ning Yao > > > 2015-12-25 14:23 GMT+08:00 Dong Wu: > > Thanks, from this pull request I learned that this issue is not > > completed, is there any new progress of this issue? > > > > 2015-12-25 12:30 GMT+08:00 Xinze Chi (??) : > >> Yeah, This is good idea for recovery, but not for backfill. > >> @YaoNing have pull a request about this > >> https://github.com/ceph/ceph/pull/3837 this year. > >> > >> 2015-12-25 11:16 GMT+08:00 Dong Wu : > >>> Hi, > >>> I have doubt about pglog, the pglog contains (op,object,version) etc. > >>> when peering, use pglog to construct missing list,then recover the > >>> whole object in missing list even if different data among replicas is > >>> less then a whole object data(eg,4MB). > >>> why not add (offset,len) to pglog? If so, the missing list can contain > >>> (object, offset, len), then we can reduce recover data. > >>> ___ > >>> ceph-users mailing list > >>> ceph-us...@lists.ceph.com > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > >> > >> -- > >> Regards, > >> Xinze Chi > > ___ > > ceph-users mailing list > > ceph-us...@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Hi, Dong Wu, 1. As I currently work for other things, this proposal is abandon for a long time 2. This is a complicated task as we need to consider a lots such as (not just for writeOp, as well as truncate, delete) and also need to consider the different affects for different backends(Replicated, EC). 3. I don't think it is good time to redo this patch now, since the BlueStore and Kstore is inprogress, and I'm afraid to bring some side-effect. We may prepare and propose the whole design in next CDS. 4. Currently, we already have some tricks to deal with recovery (like throttle the max recovery op, set the priority for recovery and so on). So this kind of patch may not solve the critical problem but just make things better, and I am not quite sure that this will really bring a big improvement. Based on my previous test, it works excellently on slow disk (say hdd), and also for a short-time maintaining. Otherwise, it will trigger the backfill process. So wait for Sage's opinion @sage If you are interest on this, we may cooperate to do this. Regards Ning Yao 2015-12-25 14:23 GMT+08:00 Dong Wu: > Thanks, from this pull request I learned that this issue is not > completed, is there any new progress of this issue? > > 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽) : >> Yeah, This is good idea for recovery, but not for backfill. >> @YaoNing have pull a request about this >> https://github.com/ceph/ceph/pull/3837 this year. >> >> 2015-12-25 11:16 GMT+08:00 Dong Wu : >>> Hi, >>> I have doubt about pglog, the pglog contains (op,object,version) etc. >>> when peering, use pglog to construct missing list,then recover the >>> whole object in missing list even if different data among replicas is >>> less then a whole object data(eg,4MB). >>> why not add (offset,len) to pglog? If so, the missing list can contain >>> (object, offset, len), then we can reduce recover data. >>> ___ >>> ceph-users mailing list >>> ceph-us...@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> -- >> Regards, >> Xinze Chi > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Thanks, from this pull request I learned that this issue is not completed, is there any new progress of this issue? 2015-12-25 12:30 GMT+08:00 Xinze Chi (信泽): > Yeah, This is good idea for recovery, but not for backfill. > @YaoNing have pull a request about this > https://github.com/ceph/ceph/pull/3837 this year. > > 2015-12-25 11:16 GMT+08:00 Dong Wu : >> Hi, >> I have doubt about pglog, the pglog contains (op,object,version) etc. >> when peering, use pglog to construct missing list,then recover the >> whole object in missing list even if different data among replicas is >> less then a whole object data(eg,4MB). >> why not add (offset,len) to pglog? If so, the missing list can contain >> (object, offset, len), then we can reduce recover data. >> ___ >> ceph-users mailing list >> ceph-us...@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Regards, > Xinze Chi -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
why not add (offset,len) to pglog
Hi, I have doubt about pglog, the pglog contains (op,object,version) etc. when peering, use pglog to construct missing list,then recover the whole object in missing list even if different data among replicas is less then a whole object data(eg,4MB). why not add (offset,len) to pglog? If so, the missing list can contain (object, offset, len), then we can reduce recover data. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] why not add (offset,len) to pglog
Yeah, This is good idea for recovery, but not for backfill. @YaoNing have pull a request about this https://github.com/ceph/ceph/pull/3837 this year. 2015-12-25 11:16 GMT+08:00 Dong Wu: > Hi, > I have doubt about pglog, the pglog contains (op,object,version) etc. > when peering, use pglog to construct missing list,then recover the > whole object in missing list even if different data among replicas is > less then a whole object data(eg,4MB). > why not add (offset,len) to pglog? If so, the missing list can contain > (object, offset, len), then we can reduce recover data. > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Regards, Xinze Chi -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fixing jenkins builds on pull requests
Hi, I triaged the jenkins related failures (from #24 to #49): CentOS 6 not supported: https://jenkins.ceph.com/job/ceph-pull-requests/26/console https://jenkins.ceph.com/job/ceph-pull-requests/28/console https://jenkins.ceph.com/job/ceph-pull-requests/29/console https://jenkins.ceph.com/job/ceph-pull-requests/34/console https://jenkins.ceph.com/job/ceph-pull-requests/38/console https://jenkins.ceph.com/job/ceph-pull-requests/44/console https://jenkins.ceph.com/job/ceph-pull-requests/46/console https://jenkins.ceph.com/job/ceph-pull-requests/48/console https://jenkins.ceph.com/job/ceph-pull-requests/49/console Ubuntu 12.04 not supported: https://jenkins.ceph.com/job/ceph-pull-requests/27/console https://jenkins.ceph.com/job/ceph-pull-requests/36/console Failure to fetch from github https://jenkins.ceph.com/job/ceph-pull-requests/35/console I've not been able to analyze more failures because it looks like only 30 jobs are kept. Here is an updated summary: * running on unsupported operating systems (CentOS 6, precise and maybe others) * leftovers from a previous test (which should be removed when a new slave is provisionned for each test) * keep the last 300 jobs for forensic analysis (about one week worth) * disable reporting to github pull requests until the above are resolved (all failures were false negative). Cheers On 23/12/2015 10:11, Loic Dachary wrote: > Hi Alfredo, > > I forgot to mention that the ./run-make-check.sh run currently has no known > false negative on CentOS 7. By that I mean that if run on master 100 times, > it will succeed 100 times. This is good to debug the jenkins builds on pull > requests as we know all problems either come from the infrastructure or the > pull request. We do not have to worry about random errors due to race > conditions in the tests or things like that. > > I'll keep an eye on the test results and analyse each failure. For now it > would be best to disable reporting failures as they are almost entirely false > negative and will confuse the contributor. The failures come from: > > * running on unsupported operating systems (CentOS 6 and maybe others) > * leftovers from a previous test (which should be removed when a new slave > is provisionned for each test) > > I'll add to this thread when / if I find more. > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
use object size of 32k rather than 4M
Hi, cephers, Sage and Haomai Recently we stuck of the performance down problem when recoverying. The scene is simple: 1. run fio with rand write(bs=4k) 2. stop one osd; sleep 10; start the osd 3. the IOPS drop from 6K to about 200 We now know the SSD which that osd on is the bottleneck when recovery. After read the code, we find the IO of that SSD come from two ways: 1. normal recovery IO 2. user IO but in the missing list, need to recovery the 4M object first. So our first step is limit the recovery IO to slow down the stress of that SSD. That helps in some scene, but not this one. We have 36 OSD with 3 replicas, so when one osd down, about 1/12 objects will be in degraded state. When we run fio with 4k randwrite, about 1/12 io will stuck and need to recovery the 4M object first. That really enlarge the stress the that SSD. In order to reduce the enlarge impact, we want to change the default size of the object from 4M to 32k. We know that will increase the number of the objects of one OSD and make remove process become longer. Hmm, here i want to ask your guys is there any other potential problems will 32k size have? If no obvious problem, will could dive into it and do more test on it. Many thanks! -- hzwulibin 2015-12-23
Re: [ceph-users] use object size of 32k rather than 4M
Hi, Robert Thanks for your quick reply. Yeah, the number of file really will be the potential problem. But if just the memory problem, we could use more memory in our OSD servers. Also, i tested it on XFS use mdtest, here is the result: $ sudo ~/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F -- [[10342,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: 10-180-0-34 Another transport will be used instead, although this may result in lower performance. -- -- started at 12/23/2015 18:59:16 -- mdtest-1.8.3 was launched with 1 total task(s) on 1 nodes Command line used: /home/ceph/wulb/bin/mdtest -I 1 -z 1 -b 1024 -R -F Path: /home/ceph FS: 824.5 GiB Used FS: 4.8% Inodes: 52.4 Mi Used Inodes: 0.6% random seed: 1450868356 1 tasks, 1025 files SUMMARY: (of 1 iterations) Operation MaxMin MeanStd Dev - ------ --- File creation : 44660.505 44660.505 44660.505 0.000 File stat : 693747.783 693747.783 693747.783 0.000 File read : 365319.444 365319.444 365319.444 0.000 File removal : 62064.560 62064.560 62064.560 0.000 Tree creation : 69680.729 69680.729 69680.729 0.000 Tree removal :352.905352.905352.905 0.000 From what i tested, the speed of File stat and File read are not slow down much. So, could i say the speed of OP like lookup a file will not decrease much, just increase the number of the files? -- hzwulibin 2015-12-23 - 发件人:"Van Leeuwen, Robert"发送日期:2015-12-23 20:57 收件人:hzwulibin,ceph-devel,ceph-users 抄送: 主题:Re: [ceph-users] use object size of 32k rather than 4M >In order to reduce the enlarge impact, we want to change the default size of >the object from 4M to 32k. > >We know that will increase the number of the objects of one OSD and make >remove process become longer. > >Hmm, here i want to ask your guys is there any other potential problems will >32k size have? If no obvious problem, will could dive into >it and do more test on it. I assume the objects on the OSDs filesystem will become 32k when you do this. So if you have 1TB of data on one OSD you will have 31 million files == 31 million inodes This is excluding the directory structure which also might be significant. If you have 10 OSDs on a server you will easily hit 310 million inodes. You will need a LOT of memory to make sure the inodes are cached but even then looking up the inode might add significant latency. My guess is it will be fast in the beginning but it will grind to an hold when the cluster gets fuller due to inodes no longer being in memory. Also this does not take in any other bottlenecks you might hit in ceph which other users can probably answer better. Cheers, Robert van Leeuwen
Re: Time to move the make check bot to jenkins.ceph.com
This is really great. Thanks Loic and Alfredo! - Ken On Tue, Dec 22, 2015 at 11:23 AM, Loic Dacharywrote: > Hi, > > The make check bot moved to jenkins.ceph.com today and ran it's first > successfull job. You will no longer see comments from the bot: it will update > the github status instead, which is less intrusive. > > Cheers > > On 21/12/2015 11:13, Loic Dachary wrote: >> Hi, >> >> The make check bot is broken in a way that I can't figure out right now. >> Maybe now is the time to move it to jenkins.ceph.com ? It should not be more >> difficult than launching the run-make-check.sh script. It does not need >> network or root access. >> >> Cheers >> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] use object size of 32k rather than 4M
>In order to reduce the enlarge impact, we want to change the default size of >the object from 4M to 32k. > >We know that will increase the number of the objects of one OSD and make >remove process become longer. > >Hmm, here i want to ask your guys is there any other potential problems will >32k size have? If no obvious problem, will could dive into >it and do more test on it. I assume the objects on the OSDs filesystem will become 32k when you do this. So if you have 1TB of data on one OSD you will have 31 million files == 31 million inodes This is excluding the directory structure which also might be significant. If you have 10 OSDs on a server you will easily hit 310 million inodes. You will need a LOT of memory to make sure the inodes are cached but even then looking up the inode might add significant latency. My guess is it will be fast in the beginning but it will grind to an hold when the cluster gets fuller due to inodes no longer being in memory. Also this does not take in any other bottlenecks you might hit in ceph which other users can probably answer better. Cheers, Robert van Leeuwen
Re: [ceph-users] use object size of 32k rather than 4M
>Thanks for your quick reply. Yeah, the number of file really will be the >potential problem. But if just the memory problem, we could use more memory in >our OSD >servers. Add more mem might not be a viable solution: Ceph does not say how much data is stores in an inode but the docs say the xattr of ext4 is not big enough. Assuming xfs will use 512 bytes is probably very optimistic. So for e.g. 300 million inodes you are talking about, at least, 150GB. > >Also, i tested it on XFS use mdtest, here is the result: > > >FS: 824.5 GiB Used FS: 4.8% Inodes: 52.4 Mi Used Inodes: 0.6% 52 million files without extended attributes is probably not a real life scenario for a filled up ceph node with multiple OSDs. Cheers, Robert van Leeuwen
Re: Let's Not Destroy the World in 2038
On 22/12/2015, Gregory Farnum wrote: [snip] > So I think we're stuck with creating a new utime_t and incrementing > the struct_v on everything that contains them. :/ [snip] > We'll also then need the full feature bit system to make > sure we send the old encoding to clients which don't understand the > new one, and to prevent a mid-upgrade cluster from writing data on a > new node that gets moved to a new node which doesn't understand it. That is my understanding. I have the impression that network communication get feature bits for the other nodes and on-disk structures are explicitly versioned. If I'm mistaken, please hurl corrections at me. > Given that utime_t occurs in a lot of places, and really can't change > *again* after this, we probably shouldn't set up the new version with > versioned encoding? You're overly pessimistic. I'm hoping our post-human descendents store their unfathomably alien, reconstructed minds in some galaxy spanning descendent of Ceph and need more than a 64-bit second count. However, I agree that the time value itself should not have an encoded version tag. To my intuition, the best way forward would be to: (1) Add non-defaulted feature parameters on encode/decode of utime_t and ceph::real_time. This will break everything that uses them. (2) Add explicit encode_old/encode_new functions. that way when we KNOW which one we want at compile time we don't have to pay for a runtime check. (3) When we have feature bits, pass them in. (4) When we have a version, bump it. For new versions, explicitly call encode_new. When we know we want old, call old. (5) If there are classes that we encode that have neither feature bits nor versioning available, see what uses them and act accordingly. Hopefully the special cases will be few. Does that seem reasonable? I thank you. And all hypothetical post-huamn Ceph users thank you. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 signature.asc Description: PGP signature
rgw: sticky user quota data on bucket removal
Hi, We're testing user quotas on Hammer with civetweb and we're running into an issue with user stats. If the user/admin removes a bucket using -force/-purge-objects options with s3cmd/radosgw-admin respectively, the user stats will continue to reflect the deleted objects for quota purposes, and there seems to be no way to reset them. It appears that user stats need to be sync'ed prior to bucket removal. Setting " rgw user quota bucket sync interval = 0" appears to solve the problem. What is the downside to setting the interval to 0? I think the right solution is to have an implied sync-stats during bucket removal. Other suggestions? All the best, Paul -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rgw: sticky user quota data on bucket removal
On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitzwrote: > Hi, > > We're testing user quotas on Hammer with civetweb and we're running into an > issue with user stats. > > If the user/admin removes a bucket using -force/-purge-objects options with > s3cmd/radosgw-admin respectively, the user stats will continue to reflect the > deleted objects for quota purposes, and there seems to be no way to reset > them. It appears that user stats need to be sync'ed prior to bucket removal. > Setting " rgw user quota bucket sync interval = 0" appears to solve the > problem. > > What is the downside to setting the interval to 0? We'll update the buckets that are getting modified continuously, instead of once every interval. > > I think the right solution is to have an implied sync-stats during bucket > removal. Other suggestions? > No, syncing the bucket stats on removal sounds right. Yehuda > All the best, > Paul > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rgw: sticky user quota data on bucket removal
> -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Yehuda Sadeh-Weinraub > Sent: Wednesday, December 23, 2015 5:02 PM > To: Paul Von-Stamwitz > Cc: ceph-devel@vger.kernel.org > Subject: Re: rgw: sticky user quota data on bucket removal > > On Wed, Dec 23, 2015 at 3:53 PM, Paul Von-Stamwitz >wrote: > > Hi, > > > > We're testing user quotas on Hammer with civetweb and we're running > > into an issue with user stats. > > > > If the user/admin removes a bucket using -force/-purge-objects options > > with s3cmd/radosgw-admin respectively, the user stats will continue to > > reflect the deleted objects for quota purposes, and there seems to be no > > way to reset them. It appears that user stats need to be sync'ed prior to > > bucket removal. Setting " rgw user quota bucket sync interval = 0" appears > > to > > solve the problem. > > > > What is the downside to setting the interval to 0? > > We'll update the buckets that are getting modified continuously, instead of > once every interval. > So, I presume that this will impact performance on puts and deletes. We'll take a look at the impact on this. > > > > I think the right solution is to have an implied sync-stats during bucket > > removal. Other suggestions? > > > > No, syncing the bucket stats on removal sounds right. > Great. This would alleviate any performance impact on continuous updates. Thanks! > Yehuda > > > All the best, > > Paul > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majord...@vger.kernel.org More > majordomo > > info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html
Re: New "make check" job for Ceph pull requests
Hi, For the record the pending issues that prevent the "make check" job (https://jenkins.ceph.com/job/ceph-pull-requests/) from running can be found at http://tracker.ceph.com/issues/14172 Cheers On 23/12/2015 21:05, Alfredo Deza wrote: > Hi all, > > As of yesterday (Tuesday Dec 22nd) we have the "make check" job > running within our CI infrastructure, working very similarly as the > previous check with a few differences: > > * there are no longer comments added to the pull requests > * notifications of success (or failure) are done inline in the same > notification box for "This branch has no conflicts with the base > branch" > * All members of the Ceph organization can trigger a job with the > following comment: > test this please > > Changes to the job should be done following our new process: anyone can open > a pull request against the "ceph-pull-requests" job that configures/modifies > it. This process is fairly minimal: > > 1) *Jobs no longer require to make changes in the Jenkins UI*, they > are rather plain text YAML files that live in the ceph/ceph-build.git > repository and have a specific structure. Job changes (including > scripts) are made directly on that repository via pull requests. > > 2) As soon as a PR is merged the changes are automatically pushed to > Jenkins. Regardless if this is a new or old job. All one needs for a > new job to appear is a directory with a working YAML file (see links > at the end on what this means) > > Below, please find a list to resources on how to make changes to a > Jenkins Job, and examples on how mostly anyone can provide changes: > > * Format and configuration of YAML files are consumed by JJB (Jenkins > Job builder), full docs are here: > http://docs.openstack.org/infra/jenkins-job-builder/definition.html > * Where does the make-check configuration lives? > https://github.com/ceph/ceph-build/tree/master/ceph-pull-requests > * Full documentation on Job structure and configuration: > https://github.com/ceph/ceph-build#ceph-build > * Everyone has READ permissions on jenkins.ceph.com (you can 'login' > with your github account), current admin members (WRITE permissions) > are: ktdreyer, alfredodeza, gregmeno, dmick, zmc, andrewschoen, > ceph-jenkins, dachary, ldachary > > If you have any questions, we can help and provide guidance and feedback. We > highly encourage contributors to take ownership on this new tool and make it > awesome! > > Thanks, > > > Alfredo > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
New "make check" job for Ceph pull requests
Hi all, As of yesterday (Tuesday Dec 22nd) we have the "make check" job running within our CI infrastructure, working very similarly as the previous check with a few differences: * there are no longer comments added to the pull requests * notifications of success (or failure) are done inline in the same notification box for "This branch has no conflicts with the base branch" * All members of the Ceph organization can trigger a job with the following comment: test this please Changes to the job should be done following our new process: anyone can open a pull request against the "ceph-pull-requests" job that configures/modifies it. This process is fairly minimal: 1) *Jobs no longer require to make changes in the Jenkins UI*, they are rather plain text YAML files that live in the ceph/ceph-build.git repository and have a specific structure. Job changes (including scripts) are made directly on that repository via pull requests. 2) As soon as a PR is merged the changes are automatically pushed to Jenkins. Regardless if this is a new or old job. All one needs for a new job to appear is a directory with a working YAML file (see links at the end on what this means) Below, please find a list to resources on how to make changes to a Jenkins Job, and examples on how mostly anyone can provide changes: * Format and configuration of YAML files are consumed by JJB (Jenkins Job builder), full docs are here: http://docs.openstack.org/infra/jenkins-job-builder/definition.html * Where does the make-check configuration lives? https://github.com/ceph/ceph-build/tree/master/ceph-pull-requests * Full documentation on Job structure and configuration: https://github.com/ceph/ceph-build#ceph-build * Everyone has READ permissions on jenkins.ceph.com (you can 'login' with your github account), current admin members (WRITE permissions) are: ktdreyer, alfredodeza, gregmeno, dmick, zmc, andrewschoen, ceph-jenkins, dachary, ldachary If you have any questions, we can help and provide guidance and feedback. We highly encourage contributors to take ownership on this new tool and make it awesome! Thanks, Alfredo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
jenkins on ceph pull requests: clarify which Operating System is used
Hi Alfredo, I see a make check slave currently runs on jessie and I think to remember it ran on trusty slaves before. It's a good thing operating systems are mixed but there does not seem to be a clear indication about which operating system is used. For instance regarding: https://jenkins.ceph.com/job/ceph-pull-requests/44/ one has to click on the console and know that it shows in the first few lines as: Building remotely on centos6+158.69.78.199 (x86_64 huge centos6 amd64) in workspace Side note: as CentOS 6 is no longer a supported platform, trying to build on it will fail. Another problem is that chosing an operating system randomly may lead to different test results and the inability for the author of the pull request to chose repeat the bug because the operating system on which it happens is not selected. Unless there is a know strategy with jenkins to deal with that kind of problem, it probably is best to stick to a single Operating System and CentOS 7 would be my choice. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
fixing jenkins builds on pull requests
Hi Alfredo, I forgot to mention that the ./run-make-check.sh run currently has no known false negative on CentOS 7. By that I mean that if run on master 100 times, it will succeed 100 times. This is good to debug the jenkins builds on pull requests as we know all problems either come from the infrastructure or the pull request. We do not have to worry about random errors due to race conditions in the tests or things like that. I'll keep an eye on the test results and analyse each failure. For now it would be best to disable reporting failures as they are almost entirely false negative and will confuse the contributor. The failures come from: * running on unsupported operating systems (CentOS 6 and maybe others) * leftovers from a previous test (which should be removed when a new slave is provisionned for each test) I'll add to this thread when / if I find more. Cheers -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Let's Not Destroy the World in 2038
Comrades, Ceph's victory is assured. It will be the storage system of The Future. Matt Benjamin has reminded me that if we don't act fast¹ Ceph will be responsible for destroying the world. utime_t() uses a 32-bit second count internally. This isn't great, but it's something we can fix. ceph::real_time currently uses a 64-bit bit count of nanoseconds, which is better. And we can change it to something else without having to rewrite much other code. The problem lies in our encode/deocde functions for time (both utime_t and ceph::real_time, since I didn't want to break compatibility.) we use a 32-bit second count. I would like to change the wire and disk representation to a 64-bit second count and a 32-bit nanosecond count. Would there be resistance to a project to do this? I don't know if a FEATURE bit would help. A FEATURE bit to toggle the width of the second count would be ideal if it would work. Otherwise it looks like the best way to do this would be to find all the structures currently ::encoded that hold time values, bump the version number and have an 'old_utime' that we use for everything pre-change. Thank you! ¹ Within the next twenty-three years. But that's not really a long time in the larger scheme of things. -- Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US IRC: Aemerson@{RedHat, OFTC, Freenode} 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 signature.asc Description: PGP signature
Re: RBD performance with many childs and snapshots
On 12/22/2015 01:55 PM, Wido den Hollander wrote: On 12/21/2015 11:51 PM, Josh Durgin wrote: On 12/21/2015 11:06 AM, Wido den Hollander wrote: Hi, While implementing the buildvolfrom method in libvirt for RBD I'm stuck at some point. $ virsh vol-clone --pool myrbdpool image1 image2 This would clone image1 to a new RBD image called 'image2'. The code I've written now does: 1. Create a snapshot called image1@libvirt- 2. Protect the snapshot 3. Clone the snapshot to 'image1' wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool rbdpool image1 image2 Vol image2 cloned from image1 wido@wido-desktop:~/repos/libvirt$ root@alpha:~# rbd -p libvirt info image2 rbd image 'image2': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1976451ead36b format: 2 features: layering, striping flags: parent: libvirt/image1@libvirt-1450724650 overlap: 10240 MB stripe unit: 4096 kB stripe count: 1 root@alpha:~# But this could potentially lead to a lot of snapshots with children on 'image1'. image1 itself will probably never change, but I'm wondering about the negative performance impact this might have on a OSD. Creating them isn't so bad, more snapshots that don't change don't have much affect on the osds. Deleting them is what's expensive, since the osds need to scan the objects to see which ones are part of the snapshot and can be deleted. If you have too many snapshots created and deleted, it can affect cluster load, so I'd rather avoid always creating a snapshot. I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' into libvirt. There is however no way to pass something like a snapshot name in libvirt when cloning. Any bright suggestions? Or is it fine to create so many snapshots? You could have canonical names for the libvirt snapshots like you suggest, 'libvirt-', and check via rbd_diff_iterate2() whether the parent image changed since the last snapshot. That's a bit slower than plain cloning, but with object map + fast diff it's fast again, since it doesn't need to scan all the objects anymore. I think libvirt would need to expand its api a bit to be able to really use it effectively to manage rbd. Hiding the snapshots becomes cumbersome if the application wants to use them too. If libvirt's current model of clones lets parents be deleted before children, that may be a hassle to hide too... I gave it a shot. callback functions are a bit new to me, but I gave it a try: https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223 Could you take a look? Left some comments on the commits. Looks good in general. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD performance with many childs and snapshots
On 12/22/2015 05:34 AM, Wido den Hollander wrote: On 21-12-15 23:51, Josh Durgin wrote: On 12/21/2015 11:06 AM, Wido den Hollander wrote: Hi, While implementing the buildvolfrom method in libvirt for RBD I'm stuck at some point. $ virsh vol-clone --pool myrbdpool image1 image2 This would clone image1 to a new RBD image called 'image2'. The code I've written now does: 1. Create a snapshot called image1@libvirt- 2. Protect the snapshot 3. Clone the snapshot to 'image1' wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool rbdpool image1 image2 Vol image2 cloned from image1 wido@wido-desktop:~/repos/libvirt$ root@alpha:~# rbd -p libvirt info image2 rbd image 'image2': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1976451ead36b format: 2 features: layering, striping flags: parent: libvirt/image1@libvirt-1450724650 overlap: 10240 MB stripe unit: 4096 kB stripe count: 1 root@alpha:~# But this could potentially lead to a lot of snapshots with children on 'image1'. image1 itself will probably never change, but I'm wondering about the negative performance impact this might have on a OSD. Creating them isn't so bad, more snapshots that don't change don't have much affect on the osds. Deleting them is what's expensive, since the osds need to scan the objects to see which ones are part of the snapshot and can be deleted. If you have too many snapshots created and deleted, it can affect cluster load, so I'd rather avoid always creating a snapshot. I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' into libvirt. There is however no way to pass something like a snapshot name in libvirt when cloning. Any bright suggestions? Or is it fine to create so many snapshots? You could have canonical names for the libvirt snapshots like you suggest, 'libvirt-', and check via rbd_diff_iterate2() whether the parent image changed since the last snapshot. That's a bit slower than plain cloning, but with object map + fast diff it's fast again, since it doesn't need to scan all the objects anymore. I'll give that a try, seems like a good suggestion! I'll have to use rbd_diff_iterate() through since iterate2() is post-hammer and that will not be available on all systems. I think libvirt would need to expand its api a bit to be able to really use it effectively to manage rbd. Hiding the snapshots becomes cumbersome if the application wants to use them too. If libvirt's current model of clones lets parents be deleted before children, that may be a hassle to hide too... Yes, I would love to see: - vol-snap-list - vol-snap-create - vol-snap-delete - vol-snap-revert And then: - vol-clone --snapshot --pool image1 image2 But this would need some more work inside libvirt. Would be very nice though. Yeah, those would be nice. At CloudStack we want to do as much as possible using libvirt, the more features it has there, the less we have to do in Java code :) Dan Berrange has talked about using libvirt storage pools for managing rbd and other storage from openstack nova too, for the same reason. I'm not sure if there are any current plans for that, but you may want to ask him about it on the libvirt list. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Time to move the make check bot to jenkins.ceph.com
Hi, The make check bot moved to jenkins.ceph.com today and ran it's first successfull job. You will no longer see comments from the bot: it will update the github status instead, which is less intrusive. Cheers On 21/12/2015 11:13, Loic Dachary wrote: > Hi, > > The make check bot is broken in a way that I can't figure out right now. > Maybe now is the time to move it to jenkins.ceph.com ? It should not be more > difficult than launching the run-make-check.sh script. It does not need > network or root access. > > Cheers > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: RFC: tool for applying 'ceph daemon ' command to all OSDs
On 12/21/2015 11:29 PM, Gregory Farnum wrote: > On Mon, Dec 21, 2015 at 9:59 PM, Dan Mickwrote: >> I needed something to fetch current config values from all OSDs (sorta >> the opposite of 'injectargs --key value), so I hacked it, and then >> spiffed it up a bit. Does this seem like something that would be useful >> in this form in the upstream Ceph, or does anyone have any thoughts on >> its design or structure? >> >> It requires a locally-installed ceph CLI and a ceph.conf that points to >> the cluster and any required keyrings. You can also provide it with >> a YAML file mapping host to osds if you want to save time collecting >> that info for a statically-defined cluster, or if you want just a subset >> of OSDs. >> >> https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py >> >> Excerpt from usage: >> >> Execute a Ceph osd daemon command on every OSD in a cluster with >> one connection to each OSD host. >> >> Usage: >> osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY) >> >> Options: >>-c CONF ceph.conf file to use [default: ./ceph.conf] >>-u USER user to connect with ssh >>-f FILE get names and osds from yaml >>COMMAND command other than "config get" to execute >>-k KEYconfig key to retrieve with config get > > I naively like the functionality being available, but if I'm skimming > this correctly it looks like you're relying on the local node being > able to passwordless-ssh to all of the nodes, and for that account to > be able to access the ceph admin sockets. Granted we rely on the ssh > for ceph-deploy as well, so maybe that's okay, but I'm not sure in > this case since it implies a lot more network openness. Yep; it's basically the same model and role assumed as "cluster destroyer". > Relatedly (perhaps in an opposing direction), maybe we want anything > exposed over the network to have some sort of explicit permissions > model? Well, I've heard that idea floated about the admin socket for years, but I don't think anyone's hot to add cephx to it :) > Maybe not and we should just ship the script for trusted users. I > would have liked it on the long-running cluster I'm sure you built it > for. ;) it's like you're clairvoyant. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is rbd_discard enough to wipe an RBD image?
On 12/21/2015 11:20 PM, Josh Durgin wrote: > On 12/21/2015 11:00 AM, Wido den Hollander wrote: >> My discard code now works, but I wanted to verify. If I understand Jason >> correctly it would be a matter of figuring out the 'order' of a image >> and call rbd_discard in a loop until you reach the end of the image. > > You'd need to get the order via rbd_stat(), convert it to object size > (i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count(). > > Then do the discards in (object size * stripe_count) chunks. This > ensures you discard entire objects. This is the size you'd want to use > for import/export as well, ideally. > Thanks! I just implemented this, could you take a look? https://github.com/wido/libvirt/commit/b07925ad50fdb6683b5b21deefceb0829a7842dc >> I just want libvirt to be as feature complete as possible when it comes >> to RBD. > > I see, makes sense. > > Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD performance with many childs and snapshots
On 12/21/2015 11:51 PM, Josh Durgin wrote: > On 12/21/2015 11:06 AM, Wido den Hollander wrote: >> Hi, >> >> While implementing the buildvolfrom method in libvirt for RBD I'm stuck >> at some point. >> >> $ virsh vol-clone --pool myrbdpool image1 image2 >> >> This would clone image1 to a new RBD image called 'image2'. >> >> The code I've written now does: >> >> 1. Create a snapshot called image1@libvirt- >> 2. Protect the snapshot >> 3. Clone the snapshot to 'image1' >> >> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool >> rbdpool image1 image2 >> Vol image2 cloned from image1 >> >> wido@wido-desktop:~/repos/libvirt$ >> >> root@alpha:~# rbd -p libvirt info image2 >> rbd image 'image2': >> size 10240 MB in 2560 objects >> order 22 (4096 kB objects) >> block_name_prefix: rbd_data.1976451ead36b >> format: 2 >> features: layering, striping >> flags: >> parent: libvirt/image1@libvirt-1450724650 >> overlap: 10240 MB >> stripe unit: 4096 kB >> stripe count: 1 >> root@alpha:~# >> >> But this could potentially lead to a lot of snapshots with children on >> 'image1'. >> >> image1 itself will probably never change, but I'm wondering about the >> negative performance impact this might have on a OSD. > > Creating them isn't so bad, more snapshots that don't change don't have > much affect on the osds. Deleting them is what's expensive, since the > osds need to scan the objects to see which ones are part of the > snapshot and can be deleted. If you have too many snapshots created and > deleted, it can affect cluster load, so I'd rather avoid always > creating a snapshot. > >> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' >> into libvirt. There is however no way to pass something like a snapshot >> name in libvirt when cloning. >> >> Any bright suggestions? Or is it fine to create so many snapshots? > > You could have canonical names for the libvirt snapshots like you > suggest, 'libvirt-', and check via rbd_diff_iterate2() > whether the parent image changed since the last snapshot. That's a bit > slower than plain cloning, but with object map + fast diff it's fast > again, since it doesn't need to scan all the objects anymore. > > I think libvirt would need to expand its api a bit to be able to really > use it effectively to manage rbd. Hiding the snapshots becomes > cumbersome if the application wants to use them too. If libvirt's > current model of clones lets parents be deleted before children, > that may be a hassle to hide too... > I gave it a shot. callback functions are a bit new to me, but I gave it a try: https://github.com/wido/libvirt/commit/756dca8023027616f53c39fa73c52a6d8f86a223 Could you take a look? > Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Let's Not Destroy the World in 2038
On Tue, Dec 22, 2015 at 12:10 PM, Adam C. Emersonwrote: > Comrades, > > Ceph's victory is assured. It will be the storage system of The Future. > Matt Benjamin has reminded me that if we don't act fast¹ Ceph will be > responsible for destroying the world. > > utime_t() uses a 32-bit second count internally. This isn't great, but it's > something we can fix. ceph::real_time currently uses a 64-bit bit count of > nanoseconds, which is better. And we can change it to something else without > having to rewrite much other code. > > The problem lies in our encode/deocde functions for time (both utime_t > and ceph::real_time, since I didn't want to break compatibility.) we > use a 32-bit second count. I would like to change the wire and disk > representation to a 64-bit second count and a 32-bit nanosecond count. > > Would there be resistance to a project to do this? I don't know if a > FEATURE bit would help. A FEATURE bit to toggle the width of the second > count would be ideal if it would work. Otherwise it looks like the best > way to do this would be to find all the structures currently ::encoded > that hold time values, bump the version number and have an 'old_utime' > that we use for everything pre-change. Unfortunately, we include utimes in structures that are written to disk. So I think we're stuck with creating a new utime_t and incrementing the struct_v on everything that contains them. :/ Of course, we'll also then need the full feature bit system to make sure we send the old encoding to clients which don't understand the new one, and to prevent a mid-upgrade cluster from writing data on a new node that gets moved to a new node which doesn't understand it. Given that utime_t occurs in a lot of places, and really can't change *again* after this, we probably shouldn't set up the new version with versioned encoding? -Greg > > Thank you! > > ¹ Within the next twenty-three years. But that's not really a long time in the > larger scheme of things. > > -- > Senior Software Engineer Red Hat Storage, Ann Arbor, MI, US > IRC: Aemerson@{RedHat, OFTC, Freenode} > 0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C 7C12 80F7 544B 90ED BFB9 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: tool for applying 'ceph daemon ' command to all OSDs
> -Original Message- > From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- > ow...@vger.kernel.org] On Behalf Of Dan Mick > Sent: Tuesday, December 22, 2015 7:00 AM > To: ceph-devel > Subject: RFC: tool for applying 'ceph daemon ' command to all OSDs > > I needed something to fetch current config values from all OSDs (sorta the > opposite of 'injectargs --key value), so I hacked it, and then spiffed it up > a bit. > Does this seem like something that would be useful in this form in the > upstream Ceph, or does anyone have any thoughts on its design or > structure? > You could do it using socat too: Node1 has osd.0 Node1: cd /var/run/ceph sudo socat TCP-LISTEN:60100,fork unix-connect:ceph-osd.0.asok Node2: cd /var/run/ceph sudo socat unix-listen:ceph-osd.0.asok,fork TCP:Node1:60100 Node2: sudo ceph daemon osd.0 help | head { "config diff": "dump diff of current config and default config", "config get": "config get : get the config value", This is more for development/test setup. Regards, Igor. > It requires a locally-installed ceph CLI and a ceph.conf that points to the > cluster and any required keyrings. You can also provide it with a YAML file > mapping host to osds if you want to save time collecting that info for a > statically-defined cluster, or if you want just a subset of OSDs. > > https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py > > Excerpt from usage: > > Execute a Ceph osd daemon command on every OSD in a cluster with one > connection to each OSD host. > > Usage: > osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY) > > Options: >-c CONF ceph.conf file to use [default: ./ceph.conf] >-u USER user to connect with ssh >-f FILE get names and osds from yaml >COMMAND command other than "config get" to execute >-k KEYconfig key to retrieve with config get > > -- > Dan Mick > Red Hat, Inc. > Ceph docs: http://ceph.com/docs > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html
Re: RBD performance with many childs and snapshots
On 21-12-15 23:51, Josh Durgin wrote: > On 12/21/2015 11:06 AM, Wido den Hollander wrote: >> Hi, >> >> While implementing the buildvolfrom method in libvirt for RBD I'm stuck >> at some point. >> >> $ virsh vol-clone --pool myrbdpool image1 image2 >> >> This would clone image1 to a new RBD image called 'image2'. >> >> The code I've written now does: >> >> 1. Create a snapshot called image1@libvirt- >> 2. Protect the snapshot >> 3. Clone the snapshot to 'image1' >> >> wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool >> rbdpool image1 image2 >> Vol image2 cloned from image1 >> >> wido@wido-desktop:~/repos/libvirt$ >> >> root@alpha:~# rbd -p libvirt info image2 >> rbd image 'image2': >> size 10240 MB in 2560 objects >> order 22 (4096 kB objects) >> block_name_prefix: rbd_data.1976451ead36b >> format: 2 >> features: layering, striping >> flags: >> parent: libvirt/image1@libvirt-1450724650 >> overlap: 10240 MB >> stripe unit: 4096 kB >> stripe count: 1 >> root@alpha:~# >> >> But this could potentially lead to a lot of snapshots with children on >> 'image1'. >> >> image1 itself will probably never change, but I'm wondering about the >> negative performance impact this might have on a OSD. > > Creating them isn't so bad, more snapshots that don't change don't have > much affect on the osds. Deleting them is what's expensive, since the > osds need to scan the objects to see which ones are part of the > snapshot and can be deleted. If you have too many snapshots created and > deleted, it can affect cluster load, so I'd rather avoid always > creating a snapshot. > >> I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' >> into libvirt. There is however no way to pass something like a snapshot >> name in libvirt when cloning. >> >> Any bright suggestions? Or is it fine to create so many snapshots? > > You could have canonical names for the libvirt snapshots like you > suggest, 'libvirt-', and check via rbd_diff_iterate2() > whether the parent image changed since the last snapshot. That's a bit > slower than plain cloning, but with object map + fast diff it's fast > again, since it doesn't need to scan all the objects anymore. > I'll give that a try, seems like a good suggestion! I'll have to use rbd_diff_iterate() through since iterate2() is post-hammer and that will not be available on all systems. > I think libvirt would need to expand its api a bit to be able to really > use it effectively to manage rbd. Hiding the snapshots becomes > cumbersome if the application wants to use them too. If libvirt's > current model of clones lets parents be deleted before children, > that may be a hassle to hide too... > Yes, I would love to see: - vol-snap-list - vol-snap-create - vol-snap-delete - vol-snap-revert And then: - vol-clone --snapshot --pool image1 image2 But this would need some more work inside libvirt. Would be very nice though. At CloudStack we want to do as much as possible using libvirt, the more features it has there, the less we have to do in Java code :) Wido > Josh > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
On Sun, Dec 20, 2015 at 7:38 PM, Eric Eastmanwrote: > On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng wrote: >> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman >> wrote: Hi Yan Zheng, Eric Eastman Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal handling fix"). Related report & discussion was here: https://lkml.org/lkml/2015/12/12/149 I'm not sure the current reported issue of ceph was related to that though, but at least try testing with an upgraded or patched kernel could verify it. :) Thanks, > >> >> please try rc5 kernel without patches and DEBUG_VM=y >> >> Regards >> Yan, Zheng > > > The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36 > hours with no ERRORS or WARNINGS. My plan is to install the 4.4rc6 > kernel from the Ubuntu kernel-ppa site once it is available, and rerun > the tests. > Test has run for 2 days using the 4.4rc6 kernel from the Ubuntu kernel-ppa kernel site without error or warning. Looks like it was a 4.4rc4 bug. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On 21-12-2015 01:45, Xinze Chi (信泽) wrote: sorry for delay reply. Please have a try https://github.com/ceph/ceph/commit/ae4a8162eacb606a7f65259c6ac236e144bfef0a. Tried this one first: Testsuite summary for ceph 10.0.1 # TOTAL: 120 # PASS: 100 # SKIP: 0 # XFAIL: 0 # FAIL: 20 # XPASS: 0 # ERROR: 0 So that certainly helps. Have not yet analyzed the log files... But is seems we are getting somewhere. Needed to manually kill a rados access in: | | | \-+- 09792 wjw /bin/sh ../test-driver ./test/ceph_objectstore_tool.py | | | \-+- 09807 wjw python ./test/ceph_objectstore_tool.py (python2.7) | | | \--- 11406 wjw /usr/srcs/Ceph/wip-freebsd-wjw/ceph/src/.libs/rados -p rep_pool -N put REPobject1 /tmp/data.9807/-REPobject1__head But also 2 mon-osd's were running, and perhaps ine was nog belonging with that test. So they could be in each others way. Found some fails in OSD's at: ./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) ./test-suite.log:osd/ECBackend.cc: 201: FAILED assert(res.errors.empty()) struct OnRecoveryReadComplete : public GenContext&> { ECBackend *pg; hobject_t hoid; set want; OnRecoveryReadComplete(ECBackend *pg, const hobject_t ) : pg(pg), hoid(hoid) {} void finish(pair ) { ECBackend::read_result_t = in.second; // FIXME??? assert(res.r == 0); 201:assert(res.errors.empty()); assert(res.returned.size() == 1); pg->handle_recovery_read_complete( hoid, res.returned.back(), res.attrs, in.first); } }; Given the FIXME?? the code here could be fishy?? I would say that just this patch would be sufficient. The second patch also looks like it is could be useful since it lowers the bar on being tested. And when just aligning is required because of (a)iovec processing that 4096 will likely suffice. Thanx you very much for the help. --WjW 2015-12-21 0:10 GMT+08:00 Willem Jan Withagen : Hi, Most of the Ceph is getting there in the most crude and rough state. So beneath is a status update on what is not working for me jet. Especially help with the aligment problem in os/FileJournal.cc would be appricated... It would allow me to run ceph-osd and run more tests to completion. What would happen if I comment out this test, and ignore the fact that thing might be unaligned? Is it a performance/paging issue? Or is data going to be corrupted? --WjW PASS: src/test/run-cli-tests Testsuite summary for ceph 10.0.0 # TOTAL: 1 # PASS: 1 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 gmake test: Testsuite summary for ceph 10.0.0 # TOTAL: 119 # PASS: 95 # SKIP: 0 # XFAIL: 0 # FAIL: 24 # XPASS: 0 # ERROR: 0 The folowing notes can be made with this: 1) the run-cli-tests run to completion because I excluded the RBD tests 2) gmake test has the following tests FAIL: FAIL: unittest_erasure_code_plugin FAIL: ceph-detect-init/run-tox.sh FAIL: test/erasure-code/test-erasure-code.sh FAIL: test/erasure-code/test-erasure-eio.sh FAIL: test/run-rbd-unit-tests.sh FAIL: test/ceph_objectstore_tool.py FAIL: test/test-ceph-helpers.sh FAIL: test/cephtool-test-osd.sh FAIL: test/cephtool-test-mon.sh FAIL: test/cephtool-test-mds.sh FAIL: test/cephtool-test-rados.sh FAIL: test/mon/osd-crush.sh FAIL: test/osd/osd-scrub-repair.sh FAIL: test/osd/osd-scrub-snaps.sh FAIL: test/osd/osd-config.sh FAIL: test/osd/osd-bench.sh FAIL: test/osd/osd-reactivate.sh FAIL: test/osd/osd-copy-from.sh FAIL: test/libradosstriper/rados-striper.sh FAIL: test/test_objectstore_memstore.sh FAIL: test/ceph-disk.sh FAIL: test/pybind/test_ceph_argparse.py FAIL: test/pybind/test_ceph_daemon.py FAIL: ../qa/workunits/erasure-code/encode-decode-non-regression.sh Most of the fails are because ceph-osd crashed consistently on: -1 journal bl.is_aligned(block_size) 0 bl.is_n_align_sized(CEPH_MINIMUM_BLOCK_SIZE) 1 -1 journal block_size 131072 CEPH_MINIMUM_BLOCK_SIZE 4096 CEPH_PAGE_SIZE 4096 header.alignment 131072 bl buffer::list(len=131072, buffer::ptr(0~131072 0x805319000 in raw 0x805319000 len 131072 nref 1)) os/FileJournal.cc: In function 'void FileJournal::align_bl(off64_t, bufferlist &)' thread 805217400 time 2015-12-19 13:43:06.706797
RBD performance with many childs and snapshots
Hi, While implementing the buildvolfrom method in libvirt for RBD I'm stuck at some point. $ virsh vol-clone --pool myrbdpool image1 image2 This would clone image1 to a new RBD image called 'image2'. The code I've written now does: 1. Create a snapshot called image1@libvirt- 2. Protect the snapshot 3. Clone the snapshot to 'image1' wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool rbdpool image1 image2 Vol image2 cloned from image1 wido@wido-desktop:~/repos/libvirt$ root@alpha:~# rbd -p libvirt info image2 rbd image 'image2': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1976451ead36b format: 2 features: layering, striping flags: parent: libvirt/image1@libvirt-1450724650 overlap: 10240 MB stripe unit: 4096 kB stripe count: 1 root@alpha:~# But this could potentially lead to a lot of snapshots with children on 'image1'. image1 itself will probably never change, but I'm wondering about the negative performance impact this might have on a OSD. I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' into libvirt. There is however no way to pass something like a snapshot name in libvirt when cloning. Any bright suggestions? Or is it fine to create so many snapshots? -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is rbd_discard enough to wipe an RBD image?
On 12/21/2015 04:50 PM, Josh Durgin wrote: > On 12/21/2015 07:09 AM, Jason Dillaman wrote: >> You will have to ensure that your writes are properly aligned with the >> object size (or object set if fancy striping is used on the RBD >> volume). In that case, the discard is translated to remove operations >> on each individual backing object. The only time zeros are written to >> disk is if you specify an offset somewhere in the middle of an object >> (i.e. the whole object cannot be deleted nor can it be truncated) -- >> this is the partial discard case controlled by that configuration param. >> > > I'm curious what's using the virVolWipe stuff - it can't guarantee it's > actually wiping the data in many common configurations, not just with > ceph but with any kind of disk, since libvirt is usually not consuming > raw disks, and with modern flash and smr drives even that is not enough. > There's a recent patch improving the docs on this [1]. > > If the goal is just to make the data inaccessible to the libvirt user, > removing the image is just as good. > > That said, with rbd there's not much cost to zeroing the image with > object map enabled - it's effectively just doing the data removal step > of 'rbd rm' early. > I was looking at the features the RBD storage pool driver is missing in libvirt and it is: - Build from Volume. That's RBD cloning - Uploading and Downloading Volume - Wiping Volume The thing about wiping in libvirt is that the volume still exists afterwards, it is just empty. My discard code now works, but I wanted to verify. If I understand Jason correctly it would be a matter of figuring out the 'order' of a image and call rbd_discard in a loop until you reach the end of the image. I just want libvirt to be as feature complete as possible when it comes to RBD. Wido > Josh > > [1] http://comments.gmane.org/gmane.comp.emulators.libvirt/122235 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FreeBSD Building and Testing
On 20-12-2015 17:10, Willem Jan Withagen wrote: Hi, Most of the Ceph is getting there in the most crude and rough state. So beneath is a status update on what is not working for me jet. Further: A) unittest_erasure_code_plugin failes on the fact that there is a different error code returned when dlopen-ing a non existent library. load dlopen(.libs/libec_invalid.so): Cannot open ".libs/libec_invalid.so"load dlsym(.libs/libec_missing_version.so, _ _erasure_code_init): Undefined symbol "__erasure_code_init"test/erasure-code/TestErasureCodePlugin.cc:88: Failure Value of: instance.factory("missing_version", g_conf->erasure_code_dir, profile, _code, ) Actual: -2 Expected: -18 EXDEV is actually 18, so that part is correct. But EXDEV is cross-device link error. Where as the actual answer: -2 is factual correct: #define ENOENT 2 /* No such file or directory */ So why is the test for EXDEV instead of ENOENT? Could be a typical Linux <> FreeBSD thingy. --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is rbd_discard enough to wipe an RBD image?
On 12/21/2015 11:00 AM, Wido den Hollander wrote: My discard code now works, but I wanted to verify. If I understand Jason correctly it would be a matter of figuring out the 'order' of a image and call rbd_discard in a loop until you reach the end of the image. You'd need to get the order via rbd_stat(), convert it to object size (i.e. (1 << order)), and fetch stripe_count with rbd_get_stripe_count(). Then do the discards in (object size * stripe_count) chunks. This ensures you discard entire objects. This is the size you'd want to use for import/export as well, ideally. I just want libvirt to be as feature complete as possible when it comes to RBD. I see, makes sense. Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: FileStore : no wait thread queue_sync
FYI. -- Forwarded message -- From: David CasierDate: 2015-12-21 23:19 GMT+01:00 Subject: FileStore : no wait thread queue_sync To: Ceph Development , Sage Weil Cc: Benoît LORIOT , Sébastien VALSEMEY Hi, What do you think about : if (!journal && m_filestore_direct) { apply_manager.commit_finish } in FileStore::queue_transactions ? For direct and no waiting (sync_entry thread) ? I would also propose putting a parameter "m_omap_is_safe" for bypass XATTR_SPILL_OUT_NAME and reduce IOPS in hard_drive if ( !m_omap_is_safe) { r = chain_fgetxattr(**o, XATTR_SPILL_OUT_NAME, buf, sizeof(buf)); if (r >= 0 && !strncmp(buf, XATTR_NO_SPILL_OUT, sizeof(XATTR_NO_SPILL_OUT))) { r = chain_fsetxattr(**n, XATTR_SPILL_OUT_NAME, XATTR_NO_SPILL_OUT, sizeof(XATTR_NO_SPILL_OUT)); } else { r = chain_fsetxattr(**n, XATTR_SPILL_OUT_NAME, XATTR_SPILL_OUT, sizeof(XATTR_SPILL_OUT)); } } -- Cordialement, David CASIER -- Cordialement, David CASIER -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD performance with many childs and snapshots
On 12/21/2015 11:06 AM, Wido den Hollander wrote: Hi, While implementing the buildvolfrom method in libvirt for RBD I'm stuck at some point. $ virsh vol-clone --pool myrbdpool image1 image2 This would clone image1 to a new RBD image called 'image2'. The code I've written now does: 1. Create a snapshot called image1@libvirt- 2. Protect the snapshot 3. Clone the snapshot to 'image1' wido@wido-desktop:~/repos/libvirt$ ./tools/virsh vol-clone --pool rbdpool image1 image2 Vol image2 cloned from image1 wido@wido-desktop:~/repos/libvirt$ root@alpha:~# rbd -p libvirt info image2 rbd image 'image2': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.1976451ead36b format: 2 features: layering, striping flags: parent: libvirt/image1@libvirt-1450724650 overlap: 10240 MB stripe unit: 4096 kB stripe count: 1 root@alpha:~# But this could potentially lead to a lot of snapshots with children on 'image1'. image1 itself will probably never change, but I'm wondering about the negative performance impact this might have on a OSD. Creating them isn't so bad, more snapshots that don't change don't have much affect on the osds. Deleting them is what's expensive, since the osds need to scan the objects to see which ones are part of the snapshot and can be deleted. If you have too many snapshots created and deleted, it can affect cluster load, so I'd rather avoid always creating a snapshot. I'd rather not hardcode a snapshot name like 'libvirt-parent-snapshot' into libvirt. There is however no way to pass something like a snapshot name in libvirt when cloning. Any bright suggestions? Or is it fine to create so many snapshots? You could have canonical names for the libvirt snapshots like you suggest, 'libvirt-', and check via rbd_diff_iterate2() whether the parent image changed since the last snapshot. That's a bit slower than plain cloning, but with object map + fast diff it's fast again, since it doesn't need to scan all the objects anymore. I think libvirt would need to expand its api a bit to be able to really use it effectively to manage rbd. Hiding the snapshots becomes cumbersome if the application wants to use them too. If libvirt's current model of clones lets parents be deleted before children, that may be a hassle to hide too... Josh -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is rbd_discard enough to wipe an RBD image?
>>I just want to know if this is sufficient to wipe a RBD image? AFAIK, ceph write zeroes in the rados objects with discard is used. They are an option for skip zeroes write if needed OPTION(rbd_skip_partial_discard, OPT_BOOL, false) // when trying to discard a range inside an object, set to true to skip zeroing the range. - Mail original - De: "Wido den Hollander"À: "ceph-devel" Envoyé: Dimanche 20 Décembre 2015 22:21:50 Objet: Is rbd_discard enough to wipe an RBD image? Hi, I'm busy implementing the volume wiping method of the libvirt storage pool backend and instead of writing to the whole RBD image with zeroes I'm using rbd_discard. Using a 4MB length I'm starting at offset 0 and work my way through the whole RBD image. A quick try shows me that my partition table + filesystem are gone on the RBD image after I've run rbd_discard. I just want to know if this is sufficient to wipe a RBD image? Or would it be better to fully fill the image with zeroes? -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: Client still connect failed leader after that mon down
On Mon, 21 Dec 2015, Zhi Zhang wrote: > Regards, > Zhi Zhang (David) > Contact: zhang.david2...@gmail.com > zhangz.da...@outlook.com > > > > -- Forwarded message -- > From: Jaze Lee> Date: Mon, Dec 21, 2015 at 4:08 PM > Subject: Re: Client still connect failed leader after that mon down > To: Zhi Zhang > > > Hello, > I am terrible sorry. > I think we may not need to reconstruct the monclient.{h,cc}, we find > the parameter is mon_client_hunt_interval is very usefull. > When we set mon_client_hunt_interval = 0.5? the time to run a ceph > command is very small even it first connects the down leader mon. > > The first time i ask the question was because we find the parameter > from official site > http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/. > It is write in this > > mon client hung interval Yep, that's a typo. Do you mind submitting a patch to fix it? Thanks! sage > > Description:The client will try a new monitor every N seconds until it > establishes a connection. > Type:Double > Default:3.0 > > And we set it. it is not work. > > I think may be it is a slip of pen? > The right configuration parameter should be mon client hunt interval > > Can someone please help me to fix this in official site? > > Thanks a lot. > > > > 2015-12-21 14:00 GMT+08:00 Jaze Lee : > > right now we use simple msg, and cpeh version is 0.80... > > > > 2015-12-21 10:55 GMT+08:00 Zhi Zhang : > >> Which msg type and ceph version are you using? > >> > >> Once we used 0.94.1 with async msg, we encountered similar issue. > >> Client was trying to connect a down monitor when it was just started > >> and this connection would hung there. This is because previous async > >> msg used blocking connection mode. > >> > >> After we back ported non-blocking mode of async msg from higher ceph > >> version, we haven't encountered such issue yet. > >> > >> > >> Regards, > >> Zhi Zhang (David) > >> Contact: zhang.david2...@gmail.com > >> zhangz.da...@outlook.com > >> > >> > >> On Fri, Dec 18, 2015 at 11:41 AM, Jevon Qiao wrote: > >>> On 17/12/15 21:27, Sage Weil wrote: > > On Thu, 17 Dec 2015, Jaze Lee wrote: > > > > Hello cephers: > > In our test, there are three monitors. We find client run ceph > > command will slow when the leader mon is down. Even after long time, a > > client run ceph command will also slow in first time. > > >From strace, we find that the client first to connect the leader, then > > after 3s, it connect the second. > > After some search we find that the quorum is not change, the leader is > > still the down monitor. > > Is that normal? Or is there something i miss? > > It's normal. Even when the quorum does change, the client doesn't > know that. It should be contacting a random mon on startup, though, so I > would expect the 3s delay 1/3 of the time. > >>> > >>> That's because client randomly picks up a mon from Monmap. But what we > >>> observed is that when a mon is down no change is made to monmap(neither > >>> the > >>> epoch nor the members). Is it the culprit for this phenomenon? > >>> > >>> Thanks, > >>> Jevon > >>> > A long-standing low-priority feature request is to have the client > contact > 2 mons in parallel so that it can still connect quickly if one is down. > It's requires some non-trivial work in mon/MonClient.{cc,h} though and I > don't think anyone has looked at it seriously. > > sage > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majord...@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > > > > > -- > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
On Sun, Dec 20, 2015 at 6:38 PM, Eric Eastmanwrote: > On Fri, Dec 18, 2015 at 12:18 AM, Yan, Zheng wrote: >> On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman >> wrote: Hi Yan Zheng, Eric Eastman Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal handling fix"). Related report & discussion was here: https://lkml.org/lkml/2015/12/12/149 I'm not sure the current reported issue of ceph was related to that though, but at least try testing with an upgraded or patched kernel could verify it. :) Thanks, > >> >> please try rc5 kernel without patches and DEBUG_VM=y >> >> Regards >> Yan, Zheng > > > The latest test with 4.4rc5 with CONFIG_DEBUG_VM=y has ran for over 36 > hours with no ERRORS or WARNINGS. My plan is to install the 4.4rc6 > kernel from the Ubuntu kernel-ppa site once it is available, and rerun > the tests. > > Before running this test I had to rebuild the Ceph File System as > after the last logged errors on Friday using the 4.4rc4 kernel, the > Ceph File system hung accessing the exported image file. After > rebooting my iSCSI gateway using the Ceph File System, from / using > command: strace du -a cephfs, the mount point, the hang happened on > the newfsstatat call on my image file: > > write(1, "0\tcephfs/ctdb/.ctdb.lock\n", 250 cephfs/ctdb/.ctdb.lock > ) = 25 > close(5)= 0 > write(1, "0\tcephfs/ctdb\n", 140 cephfs/ctdb > )= 14 > newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, > ...}, AT_SYMLINK_NOFOLLOW) = 0 > openat(4, "iscsi", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3 > fcntl(3, F_GETFD) = 0 > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > fstat(3, {st_mode=S_IFDIR|0755, st_size=993814480896, ...}) = 0 > fcntl(3, F_GETFL) = 0x38800 (flags > O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) > fcntl(3, F_SETFD, FD_CLOEXEC) = 0 > newfstatat(4, "iscsi", {st_mode=S_IFDIR|0755, st_size=993814480896, > ...}, AT_SYMLINK_NOFOLLOW) = 0 > fcntl(3, F_DUPFD, 3)= 5 > fcntl(5, F_GETFD) = 0 > fcntl(5, F_SETFD, FD_CLOEXEC) = 0 > getdents(3, /* 8 entries */, 65536) = 288 > getdents(3, /* 0 entries */, 65536) = 0 > close(3)= 0 > newfstatat(5, "iscsi900g.img", ^C > ^C^C^C > ^Z > I could not break out with a ^C, and had to background the process to > get my prompt back. The process would not die so I had to hard reset > the system. > > This same hang happened on 2 other kernel mounted systems using a 4.3.0 > kernel. > > On a separate system, I fuse mounted the file system and a du -a > cephfs hung at the same point. Once again I could not break out of the > hang, and had to hard reset the system. > > Restarting the MDS and Monitors did not clear the issue. Taking a > quick look at the dumpcache showed it was large > > # ceph mds tell 0 dumpcache /tmp/dump.txt > ok > # wc /tmp/dump.txt > 370556 5002449 59211054 /tmp/dump.txt > # tail /tmp/dump.txt > [inode 1259276 [...c4,head] ~mds0/stray0/1259276/ auth v977593 > snaprealm=0x561339e3fb00 f(v0 m2015-12-12 00:51:04.345614) n(v0 > rc2015-12-12 00:51:04.345614 1=0+1) (iversion lock) 0x561339c66228] > [inode 120c1ba [...a6,head] ~mds0/stray0/120c1ba/ auth v742016 > snaprealm=0x56133ad19600 f(v0 m2015-12-10 18:25:55.880167) n(v0 > rc2015-12-10 18:25:55.880167 1=0+1) (iversion lock) 0x56133a5e0d88] > [inode 10d0088 [...77,head] ~mds0/stray6/10d0088/ auth v292336 > snaprealm=0x5613537673c0 f(v0 m2015-12-08 19:23:20.269283) n(v0 > rc2015-12-08 19:23:20.269283 1=0+1) (iversion lock) 0x56134c2f7378] These are deleted files that haven't been trimmed yet... > > I tried one more thing: > > ceph daemon mds.0 flush journal > > and restarted the MDS. Accessing the file system still locked up, but > a du -a cephfs did not even get to the iscsi900g.img file. As I was > running on a broken rc kernel, with snapshots turned on ...and I think we have some known issues in the tracker about snap trimming and snapshotted inodes. So this is not entirely surprising. :/ -Greg >, when this > corruption happened, I decided to recreated the file system and > restarted the ESXi iSCSI test. > > Regards, > Eric > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: tool for applying 'ceph daemon ' command to all OSDs
I needed something to fetch current config values from all OSDs (sorta the opposite of 'injectargs --key value), so I hacked it, and then spiffed it up a bit. Does this seem like something that would be useful in this form in the upstream Ceph, or does anyone have any thoughts on its design or structure? It requires a locally-installed ceph CLI and a ceph.conf that points to the cluster and any required keyrings. You can also provide it with a YAML file mapping host to osds if you want to save time collecting that info for a statically-defined cluster, or if you want just a subset of OSDs. https://github.com/dmick/tools/blob/master/osd_daemon_cmd.py Excerpt from usage: Execute a Ceph osd daemon command on every OSD in a cluster with one connection to each OSD host. Usage: osd_daemon_cmd [-c CONF] [-u USER] [-f FILE] (COMMAND | -k KEY) Options: -c CONF ceph.conf file to use [default: ./ceph.conf] -u USER user to connect with ssh -f FILE get names and osds from yaml COMMAND command other than "config get" to execute -k KEYconfig key to retrieve with config get -- Dan Mick Red Hat, Inc. Ceph docs: http://ceph.com/docs -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html