Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource
Am 10.01.2013 05:32, schrieb Gary Lowell: > I have this patch, and the ones from Friday in the wip-rpm-update branch. > Everything looks good except that we have the following new warning from > configure: > > …. > checking for kaffe... no > checking for java... java > checking for uudecode... no > WARNING: configure: I have to compile Test.class from scratch > checking for gcj... no > checking for guavac... no > checking for jikes… no > …. > > This may have to do with something in our build environment. I assume you have no uudecode installed. It should be part of sharutils (http://www.gnu.org/software/sharutils/) Regards, Danny -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: recoverying from 95% full osd
Hello again! I left the system in working state overnight and got it in a wierd state this morning: chef@ceph-node02:/var/log/ceph$ ceph -s health HEALTH_OK monmap e4: 3 mons at {a=192.168.7.11:6789/0,b=192.168.7.12:6789/0,c=192.168.7.13:6789/0}, election epoch 254, quorum 0,1,2 a,b,c osdmap e348: 3 osds: 3 up, 3 in pgmap v114606: 384 pgs: 384 active+clean; 161 GB data, 326 GB used, 429 GB / 755 GB avail mdsmap e4623: 1/1/1 up {0=b=up:active}, 1 up:standby so, it looks ok from the first point of view, however I am not able to mount ceph from any of nodes: be01:~# mount /var/www/jroger.org/data mount: 192.168.7.11:/: can't read superblock on the nodes, which had ceph mounted yesterday I am able to look through the filesystem, but any kind of data read causes client to hang. I made a trace on the active mds with debug ms/mds = 20 (http://wh.of.kz/ceph_logs.tar.gz) Could you please help to identify what's going on. 2013/1/9 Roman Hlynovskiy : >>> How many pgs do you have? ('ceph osd dump | grep ^pool'). >> >> I believe this is it. 384 PGs, but three pools of which only one (or maybe a >> second one, sort of) is in use. Automatically setting the right PG counts is >> coming some day, but until then being able to set up pools of the right size >> is a big gotcha. :( >> Depending on how mutable the data is, recreate with larger PG counts on the >> pools in use. Otherwise we can do something more detailed. >> -Greg > > hm... what would be recommended PG size per pool ? > > chef@cephgw:~$ ceph osd lspools > 0 data,1 metadata,2 rbd, > chef@cephgw:~$ ceph osd pool get data pg_num > PG_NUM: 128 > chef@cephgw:~$ ceph osd pool get metadata pg_num > PG_NUM: 128 > chef@cephgw:~$ ceph osd pool get rbd pg_num > PG_NUM: 128 > > according to the > http://ceph.com/docs/master/rados/operations/placement-groups/ > > (OSDs * 100) > Total PGs = >Replicas > > I have 3 OSDs and 2 replicas for each object, which gives recommended PG = 150 > > will it make much difference to set 150 instead of 128 or I should > base on different values? > > btw, just one more off-topic question: > > chef@ceph-node03:~$ ceph pg dump| egrep -v '^(0\.|1\.|2\.)'| column -t > dumped allin format plain > version113906 > last_osdmap_epoch 323 > last_pg_scan 1 > full_ratio 0.95 > nearfull_ratio 0.85 > pg_statobjectsmipdegr unfbytes > log disklog state state_stamp v reported up > acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp > pool 0 74748 0 0 0 > 286157692336 17668034 17668034 > pool 1 6180 0 0 > 131846062 6414518 6414518 > pool 2 0 0 0 0 > 0 0 0 > sum75366 0 0 0 > 286289538398 24082552 24082552 > osdstatkbused kbavailkb hb in > hbout > 0 157999220 106227596 264226816 [1,2] [] > 1 185604948 78621868 264226816 [0,2] [] > 2 219475396 44751420 264226816 [0,1] [] > sum563079564 229600884 792680448 > > pool 0 (data) is used for data storage > pool 1 (metadata) is used for metadata storage > > what is pool 2 (rbd) for? looks like it's absolutely empty. > > >> >>> >>> You might also adjust the crush tunables, see >>> >>> http://ceph.com/docs/master/rados/operations/crush-map/?highlight=tunable#tunables >>> >>> sage >>> > > Thanks for the link, Sage I set tunable values according to the doc. > Btw, online document is missing magical param for crushmap which > allows those scary_tunables ) > > > > -- > ...WBR, Roman Hlynovskiy -- ...WBR, Roman Hlynovskiy -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource
I have this patch, and the ones from Friday in the wip-rpm-update branch. Everything looks good except that we have the following new warning from configure: …. checking for kaffe... no checking for java... java checking for uudecode... no WARNING: configure: I have to compile Test.class from scratch checking for gcj... no checking for guavac... no checking for jikes… no …. This may have to do with something in our build environment. Cheers, Gary On Jan 9, 2013, at 1:54 PM, Noah Watkins wrote: > I haven't tested this yet, but I like it. I think several of these > macros can be used to simplify a bit more of the Java config bit. I > also just saw the ax_jni_include_dir macro in the autoconf archive and > it looks like that can help clean-up too. > > On Wed, Jan 9, 2013 at 1:35 PM, Danny Al-Gaaf wrote: >> The attached patch depends on the set of 6 patches I send some days ago. >> See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793 >> >> Danny Al-Gaaf (1): >> configure.ac: check for org.junit.rules.ExternalResource >> >> autogen.sh| 2 +- >> configure.ac | 29 ++--- >> m4/ac_check_class.m4 | 108 >> ++ >> m4/ac_check_classpath.m4 | 24 +++ >> m4/ac_check_rqrd_class.m4 | 26 +++ >> m4/ac_java_options.m4 | 33 ++ >> m4/ac_prog_jar.m4 | 39 + >> m4/ac_prog_java.m4| 83 +++ >> m4/ac_prog_java_works.m4 | 98 + >> m4/ac_prog_javac.m4 | 45 +++ >> m4/ac_prog_javac_works.m4 | 36 >> m4/ac_prog_javah.m4 | 28 >> m4/ac_try_compile_java.m4 | 40 + >> m4/ac_try_run_javac.m4| 41 ++ >> 14 files changed, 615 insertions(+), 17 deletions(-) >> create mode 100644 m4/ac_check_class.m4 >> create mode 100644 m4/ac_check_classpath.m4 >> create mode 100644 m4/ac_check_rqrd_class.m4 >> create mode 100644 m4/ac_java_options.m4 >> create mode 100644 m4/ac_prog_jar.m4 >> create mode 100644 m4/ac_prog_java.m4 >> create mode 100644 m4/ac_prog_java_works.m4 >> create mode 100644 m4/ac_prog_javac.m4 >> create mode 100644 m4/ac_prog_javac_works.m4 >> create mode 100644 m4/ac_prog_javah.m4 >> create mode 100644 m4/ac_try_compile_java.m4 >> create mode 100644 m4/ac_try_run_javac.m4 >> >> -- >> 1.8.1 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash, ceph version 0.56.1
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil wrote: > On Wed, 9 Jan 2013, Ian Pye wrote: >> Hi, >> >> Every time I try an bring up an OSD, it crashes and I get the >> following: "error (121) Remote I/O error not handled on operation 20" > > This error code (EREMOTEIO) is not used by Ceph. What fs are you using? > Which kernel version? Anything else unusual happen with your hardware > recently that might have wreaked havoc on your underlying fs? 3.7.1 kernel with XFS. Its a demo-box from a vendor, so should be brand new. I'm going to say its a disk error, given the following: mkfs.xfs: read failed: Input/output error Interestingly, running an osd and btrfs worked fine on the same disk. Thanks for the help, Ian > > sage > > > >> The cluster is new and only has a little bit of data on it. Any ideas >> what is going on? Does Remote I/O mean a network error? Full log >> below: >> >>-9> 2013-01-10 00:00:20.182237 7f2ddde8f910 0 >> filestore(/mnt/dist_j/ceph) error (121) Remote I/O error not handled >> on operation 20 (12.0.0, or op 0, counting from 0) >> -8> 2013-01-10 00:00:20.182275 7f2ddde8f910 0 >> filestore(/mnt/dist_j/ceph) unexpected error code >> -7> 2013-01-10 00:00:20.182285 7f2ddde8f910 0 >> filestore(/mnt/dist_j/ceph) transaction dump: >> { "ops": [ >> { "op_num": 0, >> "op_name": "mkcoll", >> "collection": "0.2c0_head"}, >> { "op_num": 1, >> "op_name": "collection_setattr", >> "collection": "0.2c0_head", >> "name": "info", >> "length": 5}, >> { "op_num": 2, >> "op_name": "truncate", >> "collection": "meta", >> "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", >> "offset": 0}, >> { "op_num": 3, >> "op_name": "write", >> "collection": "meta", >> "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", >> "length": 531, >> "offset": 0, >> "bufferlist length": 531}, >> { "op_num": 4, >> "op_name": "remove", >> "collection": "meta", >> "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"}, >> { "op_num": 5, >> "op_name": "write", >> "collection": "meta", >> "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1", >> "length": 0, >> "offset": 0, >> "bufferlist length": 0}, >> { "op_num": 6, >> "op_name": "collection_setattr", >> "collection": "0.2c0_head", >> "name": "ondisklog", >> "length": 34}, >> { "op_num": 7, >> "op_name": "nop"}]} >> -6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient: >> _send_mon_message to mon.a at 108.162.209.120:6789/0 >> -5> 2013-01-10 00:00:20.183108 7f2dd5e7f910 1 -- >> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 >> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} >> v22) v1 -- ?+0 0x5b15600 con 0x34629a0 >> -4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient: >> _send_mon_message to mon.a at 108.162.209.120:6789/0 >> -3> 2013-01-10 00:00:20.183797 7f2dd6680910 1 -- >> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 >> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} >> v22) v1 -- ?+0 0x5f75600 con 0x34629a0 >> -2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient: >> _send_mon_message to mon.a at 108.162.209.120:6789/0 >> -1> 2013-01-10 00:00:20.184338 7f2dd5e7f910 1 -- >> 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 >> {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} >> v22) v1 -- ?+0 0x5b15400 con 0x34629a0 >> 0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In >> function 'unsigned int >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' >> thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422 >> os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error") >> >> ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) >> 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned >> long, int)+0x90a) [0x73e14a] >> 2: (FileStore::do_transactions(std::list> std::allocator >&, unsigned long)+0x4c) >> [0x7455dc] >> 3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b] >> 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb] >> 5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0] >> 6: /lib/libpthread.so.0 [0x7f2de6d087aa] >> 7: (clone()+0x6d) [0x7f2de518159d] >> NOTE: a copy of the executable, or `objdump -rdS ` is >> needed to interpret this. >> >> --- logging levels --- >>0/ 5 none >>0/ 1 lockdep >>0/ 1 context >>1/ 1 crush >>1/ 5 mds >>
Re: ceph caps (Ganesha + Ceph pnfs)
On Tue, 8 Jan 2013, Matt W. Benjamin wrote: > Hi Sage, > > - "Sage Weil" wrote: > > Your prevoius question made it sound like the DS was interacting with > > > > libcephfs and dealing with (some) MDS capabilities. Is that right? > > > > I wonder if a much simpler approach would be to make a different fh > > format or type, and just cram the inode and ceph object/block number > > in there. Then the DS can just go direct to rados and avoid > > interacting with the fs at all. There are some additional semantics > > surrounding the truncate metadata, but if we're lucky that can fit > > inside the fh, and the DS servers could really just act like object > > targets--no libcephfs or MDS interaction at all. > > The current architecture gets the inode and block information to the DS > reliably already without change to the Ceph fh--decoding steering > information happens at the MDS, rather than the DS. It is important to > us to ensure that the total steering information be "finite and > manageable," though, since we need it to travel with the pNFS layout to > the NFS client. As a practical matter, that means your DS is actually doing an open/lookupo on the fh? My general concern is that that'll kill performance... > It is definitely the goal for the DS to go direct to rados. I think the > outstanding issue may be limited to getting the MDS view of metadata > up-to-date after an extending or truncating i/o completes (at least in > the immediate term). ...but now I see the issue with committing the layout on the DS vs the MDS. > You may well be thinking, "sheesh, the client is doing out-of-band i/o, > why doesn't it send the LAYOUTCOMMIT operation to the MDS to update the > metadata." The unsatisfactory answer is that currently (due to our use > of the "files" layout type) clients can insist that the DS do the > commit. The Linux kernel client does so for writes below a size > threshold. > > For the longer term, an option is shaping up that would allow us to use > the objects layout (RFC 5664), which always commits layouts. Meaning, the client always commit the layout via the MDS after writing data to the objects? > This > discussion seems to be adding to the argument in support of switching, > frankly. My intuition is that it's preferable to let the DS jump layers > to commit, though, even if we want to elide such commits in future (not > just for expediency, but because the flexibility to do it seems like a > win for the Ceph architecture). Maybe.. but if the DS's don't have open sessions with the MDS, they'd have to open them. Even if they did, they'd need to get caps on the inode before the could flush new size/mtime metadata. Unless we add a new operation to behave similar to how we normally do with cap flushes: if make the size at least X and mtime at least Y. For small files, that seems like a win. For large files, you don't want to send a request like that to the MDS for every object/block if you can do it onces from the pnfs client -> mds. Am I understanding correctly that doing a single commit from the client (with the final file size) is what the object layout allows? sage > > > > > Either way, to your first (original question), yes, we should expose a > > way via libcephfs to take a reference on the capability that isn't > > released until the layout is committed. That should be pretty > > straightforward to do, I think. > > Excellent. > > > > > Hopefully my understanding is getting closer! > > > > :) sage > > > > Indeed, thanks > > -- > Matt Benjamin > The Linux Box > 206 South Fifth Ave. Suite 150 > Ann Arbor, MI 48104 > > http://linuxbox.com > > tel. 734-761-4689 > fax. 734-769-8938 > cel. 734-216-5309 > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash, ceph version 0.56.1
On Wed, 9 Jan 2013, Ian Pye wrote: > Hi, > > Every time I try an bring up an OSD, it crashes and I get the > following: "error (121) Remote I/O error not handled on operation 20" This error code (EREMOTEIO) is not used by Ceph. What fs are you using? Which kernel version? Anything else unusual happen with your hardware recently that might have wreaked havoc on your underlying fs? sage > The cluster is new and only has a little bit of data on it. Any ideas > what is going on? Does Remote I/O mean a network error? Full log > below: > >-9> 2013-01-10 00:00:20.182237 7f2ddde8f910 0 > filestore(/mnt/dist_j/ceph) error (121) Remote I/O error not handled > on operation 20 (12.0.0, or op 0, counting from 0) > -8> 2013-01-10 00:00:20.182275 7f2ddde8f910 0 > filestore(/mnt/dist_j/ceph) unexpected error code > -7> 2013-01-10 00:00:20.182285 7f2ddde8f910 0 > filestore(/mnt/dist_j/ceph) transaction dump: > { "ops": [ > { "op_num": 0, > "op_name": "mkcoll", > "collection": "0.2c0_head"}, > { "op_num": 1, > "op_name": "collection_setattr", > "collection": "0.2c0_head", > "name": "info", > "length": 5}, > { "op_num": 2, > "op_name": "truncate", > "collection": "meta", > "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", > "offset": 0}, > { "op_num": 3, > "op_name": "write", > "collection": "meta", > "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", > "length": 531, > "offset": 0, > "bufferlist length": 531}, > { "op_num": 4, > "op_name": "remove", > "collection": "meta", > "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"}, > { "op_num": 5, > "op_name": "write", > "collection": "meta", > "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1", > "length": 0, > "offset": 0, > "bufferlist length": 0}, > { "op_num": 6, > "op_name": "collection_setattr", > "collection": "0.2c0_head", > "name": "ondisklog", > "length": 34}, > { "op_num": 7, > "op_name": "nop"}]} > -6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient: > _send_mon_message to mon.a at 108.162.209.120:6789/0 > -5> 2013-01-10 00:00:20.183108 7f2dd5e7f910 1 -- > 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 > {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} > v22) v1 -- ?+0 0x5b15600 con 0x34629a0 > -4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient: > _send_mon_message to mon.a at 108.162.209.120:6789/0 > -3> 2013-01-10 00:00:20.183797 7f2dd6680910 1 -- > 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 > {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} > v22) v1 -- ?+0 0x5f75600 con 0x34629a0 > -2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient: > _send_mon_message to mon.a at 108.162.209.120:6789/0 > -1> 2013-01-10 00:00:20.184338 7f2dd5e7f910 1 -- > 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 > {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} > v22) v1 -- ?+0 0x5b15400 con 0x34629a0 > 0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In > function 'unsigned int > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' > thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422 > os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error") > > ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned > long, int)+0x90a) [0x73e14a] > 2: (FileStore::do_transactions(std::list std::allocator >&, unsigned long)+0x4c) > [0x7455dc] > 3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b] > 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb] > 5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0] > 6: /lib/libpthread.so.0 [0x7f2de6d087aa] > 7: (clone()+0x6d) [0x7f2de518159d] > NOTE: a copy of the executable, or `objdump -rdS ` is > needed to interpret this. > > --- logging levels --- >0/ 5 none >0/ 1 lockdep >0/ 1 context >1/ 1 crush >1/ 5 mds >1/ 5 mds_balancer >1/ 5 mds_locker >1/ 5 mds_log >1/ 5 mds_log_expire >1/ 5 mds_migrator >0/ 1 buffer >0/ 1 timer >0/ 1 filer >0/ 1 striper >0/ 1 objecter >0/ 5 rados >0/ 5 rbd >0/ 5 journaler >0/ 5 objectcacher >0/ 5 client >0/ 5 osd >0/ 5 optracker >0/ 5 objclass >1/ 3 filestore >1/ 3 journal >0/ 5 ms >1/ 5 mon >0/10 monc >0/ 5 paxos
OSD crash, ceph version 0.56.1
Hi, Every time I try an bring up an OSD, it crashes and I get the following: "error (121) Remote I/O error not handled on operation 20" The cluster is new and only has a little bit of data on it. Any ideas what is going on? Does Remote I/O mean a network error? Full log below: -9> 2013-01-10 00:00:20.182237 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) error (121) Remote I/O error not handled on operation 20 (12.0.0, or op 0, counting from 0) -8> 2013-01-10 00:00:20.182275 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) unexpected error code -7> 2013-01-10 00:00:20.182285 7f2ddde8f910 0 filestore(/mnt/dist_j/ceph) transaction dump: { "ops": [ { "op_num": 0, "op_name": "mkcoll", "collection": "0.2c0_head"}, { "op_num": 1, "op_name": "collection_setattr", "collection": "0.2c0_head", "name": "info", "length": 5}, { "op_num": 2, "op_name": "truncate", "collection": "meta", "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", "offset": 0}, { "op_num": 3, "op_name": "write", "collection": "meta", "oid": "a04c46e9\/pginfo_0.2c0\/0\/\/-1", "length": 531, "offset": 0, "bufferlist length": 531}, { "op_num": 4, "op_name": "remove", "collection": "meta", "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1"}, { "op_num": 5, "op_name": "write", "collection": "meta", "oid": "1f9ede85\/pglog_0.2c0\/0\/\/-1", "length": 0, "offset": 0, "bufferlist length": 0}, { "op_num": 6, "op_name": "collection_setattr", "collection": "0.2c0_head", "name": "ondisklog", "length": 34}, { "op_num": 7, "op_name": "nop"}]} -6> 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -5> 2013-01-10 00:00:20.183108 7f2dd5e7f910 1 -- 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5b15600 con 0x34629a0 -4> 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -3> 2013-01-10 00:00:20.183797 7f2dd6680910 1 -- 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5f75600 con 0x34629a0 -2> 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient: _send_mon_message to mon.a at 108.162.209.120:6789/0 -1> 2013-01-10 00:00:20.184338 7f2dd5e7f910 1 -- 108.162.209.120:6834/6359 --> 108.162.209.120:6789/0 -- osd_pgtemp(e22 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]} v22) v1 -- ?+0 0x5b15400 con 0x34629a0 0> 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)' thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422 os/FileStore.cc: 2681: FAILED assert(0 == "unexpected error") ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int)+0x90a) [0x73e14a] 2: (FileStore::do_transactions(std::list >&, unsigned long)+0x4c) [0x7455dc] 3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b] 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb] 5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0] 6: /lib/libpthread.so.0 [0x7f2de6d087aa] 7: (clone()+0x6d) [0x7f2de518159d] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 0/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 hadoop 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent10 max_new 1000 log_file /var/log/ceph/ceph-osd.9.log --- end dump of recent events --- 2013-01-10 00:00:20.227763 7f2ddde8f910 -1 *** Caught signal (Aborted) ** in thread 7f2ddde8f910 ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
Re: OSD memory leaks?
Thank you. I appreciate it! Dave Spano Optogenics Systems Administrator - Original Message - From: "Sébastien Han" To: "Dave Spano" Cc: "ceph-devel" , "Samuel Just" Sent: Wednesday, January 9, 2013 5:12:12 PM Subject: Re: OSD memory leaks? Dave, I share you my little script for now if you want it: #!/bin/bash for i in $(ps aux | grep [c]eph-osd | awk '{print $4}') do MEM_INTEGER=$(echo $i | cut -d '.' -f1) OSD=$(ps aux | grep [c]eph-osd | grep "$i " | awk '{print $13}') if [[ $MEM_INTEGER -ge 25 ]];then service ceph restart osd.$OSD >> /dev/null if [ $? -eq 0 ]; then logger -t ceph-memory-usage "The OSD number $OSD has been restarted since it was using $i % of the memory" else logger -t ceph-memory-usage "ERROR while restarting the OSD daemon" fi else logger -t ceph-memory-usage "The OSD number $OSD is only using $i % of the memory, doing nothing" fi logger -t ceph-memory-usage "Waiting 60 seconds before testing the next OSD..." sleep 60 done logger -t ceph-memory-usage "Ceph state after memory check operation is: $(ceph health)" Crons run with 10 min interval everyday for each storage node ;-). Waiting for some Inktank guys now :-). -- Regards, Sébastien Han. On Wed, Jan 9, 2013 at 10:42 PM, Dave Spano wrote: > That's very good to know. I'll be restarting ceph-osd right now! Thanks for > the heads up! > > Dave Spano > Optogenics > Systems Administrator > > > > - Original Message - > > From: "Sébastien Han" > To: "Dave Spano" > Cc: "ceph-devel" , "Samuel Just" > > Sent: Wednesday, January 9, 2013 11:35:13 AM > Subject: Re: OSD memory leaks? > > If you wait too long, the system will trigger OOM killer :D, I already > experienced that unfortunately... > > Sam? > > On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano wrote: >> OOM killer > > > > -- > Regards, > Sébastien Han. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
Dave, I share you my little script for now if you want it: #!/bin/bash for i in $(ps aux | grep [c]eph-osd | awk '{print $4}') do MEM_INTEGER=$(echo $i | cut -d '.' -f1) OSD=$(ps aux | grep [c]eph-osd | grep "$i " | awk '{print $13}') if [[ $MEM_INTEGER -ge 25 ]];then service ceph restart osd.$OSD >> /dev/null if [ $? -eq 0 ]; then logger -t ceph-memory-usage "The OSD number $OSD has been restarted since it was using $i % of the memory" else logger -t ceph-memory-usage "ERROR while restarting the OSD daemon" fi else logger -t ceph-memory-usage "The OSD number $OSD is only using $i % of the memory, doing nothing" fi logger -t ceph-memory-usage "Waiting 60 seconds before testing the next OSD..." sleep 60 done logger -t ceph-memory-usage "Ceph state after memory check operation is: $(ceph health)" Crons run with 10 min interval everyday for each storage node ;-). Waiting for some Inktank guys now :-). -- Regards, Sébastien Han. On Wed, Jan 9, 2013 at 10:42 PM, Dave Spano wrote: > That's very good to know. I'll be restarting ceph-osd right now! Thanks for > the heads up! > > Dave Spano > Optogenics > Systems Administrator > > > > - Original Message - > > From: "Sébastien Han" > To: "Dave Spano" > Cc: "ceph-devel" , "Samuel Just" > > Sent: Wednesday, January 9, 2013 11:35:13 AM > Subject: Re: OSD memory leaks? > > If you wait too long, the system will trigger OOM killer :D, I already > experienced that unfortunately... > > Sam? > > On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano wrote: >> OOM killer > > > > -- > Regards, > Sébastien Han. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] configure.ac: check for org.junit.rules.ExternalResource
I haven't tested this yet, but I like it. I think several of these macros can be used to simplify a bit more of the Java config bit. I also just saw the ax_jni_include_dir macro in the autoconf archive and it looks like that can help clean-up too. On Wed, Jan 9, 2013 at 1:35 PM, Danny Al-Gaaf wrote: > The attached patch depends on the set of 6 patches I send some days ago. > See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793 > > Danny Al-Gaaf (1): > configure.ac: check for org.junit.rules.ExternalResource > > autogen.sh| 2 +- > configure.ac | 29 ++--- > m4/ac_check_class.m4 | 108 > ++ > m4/ac_check_classpath.m4 | 24 +++ > m4/ac_check_rqrd_class.m4 | 26 +++ > m4/ac_java_options.m4 | 33 ++ > m4/ac_prog_jar.m4 | 39 + > m4/ac_prog_java.m4| 83 +++ > m4/ac_prog_java_works.m4 | 98 + > m4/ac_prog_javac.m4 | 45 +++ > m4/ac_prog_javac_works.m4 | 36 > m4/ac_prog_javah.m4 | 28 > m4/ac_try_compile_java.m4 | 40 + > m4/ac_try_run_javac.m4| 41 ++ > 14 files changed, 615 insertions(+), 17 deletions(-) > create mode 100644 m4/ac_check_class.m4 > create mode 100644 m4/ac_check_classpath.m4 > create mode 100644 m4/ac_check_rqrd_class.m4 > create mode 100644 m4/ac_java_options.m4 > create mode 100644 m4/ac_prog_jar.m4 > create mode 100644 m4/ac_prog_java.m4 > create mode 100644 m4/ac_prog_java_works.m4 > create mode 100644 m4/ac_prog_javac.m4 > create mode 100644 m4/ac_prog_javac_works.m4 > create mode 100644 m4/ac_prog_javah.m4 > create mode 100644 m4/ac_try_compile_java.m4 > create mode 100644 m4/ac_try_run_javac.m4 > > -- > 1.8.1 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
That's very good to know. I'll be restarting ceph-osd right now! Thanks for the heads up! Dave Spano Optogenics Systems Administrator - Original Message - From: "Sébastien Han" To: "Dave Spano" Cc: "ceph-devel" , "Samuel Just" Sent: Wednesday, January 9, 2013 11:35:13 AM Subject: Re: OSD memory leaks? If you wait too long, the system will trigger OOM killer :D, I already experienced that unfortunately... Sam? On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano wrote: > OOM killer -- Regards, Sébastien Han. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] configure.ac: check for org.junit.rules.ExternalResource
Check for org.junit.rules.ExternalResource if build with --enable-cephfs-java and --with-debug. Checking for junit4 isn't enough since junit4 has this class not before 4.7. Added some m4 files to get some JAVA related macros. Changed autogen.sh to work with local m4 files/macros. Signed-off-by: Danny Al-Gaaf --- autogen.sh| 2 +- configure.ac | 29 ++--- m4/ac_check_class.m4 | 108 ++ m4/ac_check_classpath.m4 | 24 +++ m4/ac_check_rqrd_class.m4 | 26 +++ m4/ac_java_options.m4 | 33 ++ m4/ac_prog_jar.m4 | 39 + m4/ac_prog_java.m4| 83 +++ m4/ac_prog_java_works.m4 | 98 + m4/ac_prog_javac.m4 | 45 +++ m4/ac_prog_javac_works.m4 | 36 m4/ac_prog_javah.m4 | 28 m4/ac_try_compile_java.m4 | 40 + m4/ac_try_run_javac.m4| 41 ++ 14 files changed, 615 insertions(+), 17 deletions(-) create mode 100644 m4/ac_check_class.m4 create mode 100644 m4/ac_check_classpath.m4 create mode 100644 m4/ac_check_rqrd_class.m4 create mode 100644 m4/ac_java_options.m4 create mode 100644 m4/ac_prog_jar.m4 create mode 100644 m4/ac_prog_java.m4 create mode 100644 m4/ac_prog_java_works.m4 create mode 100644 m4/ac_prog_javac.m4 create mode 100644 m4/ac_prog_javac_works.m4 create mode 100644 m4/ac_prog_javah.m4 create mode 100644 m4/ac_try_compile_java.m4 create mode 100644 m4/ac_try_run_javac.m4 diff --git a/autogen.sh b/autogen.sh index 08e435b..9d6a77b 100755 --- a/autogen.sh +++ b/autogen.sh @@ -12,7 +12,7 @@ check_for_pkg_config() { } rm -f config.cache -aclocal #-I m4 +aclocal -I m4 --install check_for_pkg_config libtoolize --force --copy autoconf diff --git a/configure.ac b/configure.ac index 832054b..32814b8 100644 --- a/configure.ac +++ b/configure.ac @@ -271,9 +271,6 @@ AM_CONDITIONAL(ENABLE_CEPHFS_JAVA, test "x$enable_cephfs_java" = "xyes") AC_ARG_WITH(jdk-dir, AC_HELP_STRING([--with-jdk-dir(=DIR)], [Path to JDK directory])) -AC_DEFUN([JAVA_DNE], - AC_MSG_ERROR([Cannot find $1 '$2'. Try setting --with-jdk-dir])) - AS_IF([test "x$enable_cephfs_java" = "xyes"], [ # setup bin/include dirs from --with-jdk-dir (search for jni.h, javac) @@ -314,20 +311,20 @@ AS_IF([test "x$enable_cephfs_java" = "xyes"], [ AC_MSG_NOTICE([Cannot find junit4.jar (apt-get install junit4)]) [have_junit4=0]])]) - # Check for Java programs: javac, javah, jar -PATH_save=$PATH - PATH="$PATH:$EXTRA_JDK_BIN_DIR" - AC_PATH_PROG(JAVAC, javac) -AC_PATH_PROG(JAVAH, javah) -AC_PATH_PROG(JAR, jar) -PATH=$PATH_save + AC_CHECK_CLASSPATH + AC_PROG_JAVAC + AC_PROG_JAVAH + AC_PROG_JAR -# Ensure we have them... -AS_IF([test -z "$JAVAC"], JAVA_DNE(program, javac)) -AS_IF([test -z "$JAVAH"], JAVA_DNE(program, javah)) -AS_IF([test -z "$JAR"], JAVA_DNE(program, jar)) + CLASSPATH=$CLASSPATH:$EXTRA_CLASSPATH_JAR + export CLASSPATH + AC_MSG_NOTICE([classpath - $CLASSPATH]) + AS_IF([test "$have_junit4" = "1"], [ + AC_CHECK_CLASS([org.junit.rules.ExternalResource], [], [ + AC_MSG_NOTICE(Could not find org.junit.rules.ExternalResource) + have_junit4=0])]) -# Check for jni.h +# Check for jni.h CPPFLAGS_save=$CPPFLAGS AS_IF([test -n "$EXTRA_JDK_INC_DIR"], @@ -336,7 +333,7 @@ AS_IF([test "x$enable_cephfs_java" = "xyes"], [ [JDK_CPPFLAGS="$JDK_CPPFLAGS -I$EXTRA_JDK_INC_DIR/linux"]) CPPFLAGS="$CPPFLAGS $JDK_CPPFLAGS"]) - AC_CHECK_HEADER([jni.h], [], JAVA_DNE(header, jni.h)) + AC_CHECK_HEADER([jni.h], [], AC_MSG_ERROR([Cannot find header 'jni.h'. Try setting --with-jdk-dir])) CPPFLAGS=$CPPFLAGS_save diff --git a/m4/ac_check_class.m4 b/m4/ac_check_class.m4 new file mode 100644 index 000..17932c5 --- /dev/null +++ b/m4/ac_check_class.m4 @@ -0,0 +1,108 @@ +dnl @synopsis AC_CHECK_CLASS +dnl +dnl AC_CHECK_CLASS tests the existence of a given Java class, either in +dnl a jar or in a '.class' file. +dnl +dnl *Warning*: its success or failure can depend on a proper setting of +dnl the CLASSPATH env. variable. +dnl +dnl Note: This is part of the set of autoconf M4 macros for Java +dnl programs. It is VERY IMPORTANT that you download the whole set, +dnl some macros depend on other. Unfortunately, the autoconf archive +dnl does not support the concept of set of macros, so I had to break it +dnl for submission. The general documentation, as well as the sample +dnl configure.in, is included in the AC_PROG_JAVA macro. +dnl +dnl @category Java +dnl @author Stephane Bortzmeyer +dnl @version 2000-07-19 +dnl @license GPLWithACException + +AC_DEFUN([AC_CHECK_CLASS],[ +AC_REQUIRE([AC_PR
[PATCH] configure.ac: check for org.junit.rules.ExternalResource
The attached patch depends on the set of 6 patches I send some days ago. See: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/11793 Danny Al-Gaaf (1): configure.ac: check for org.junit.rules.ExternalResource autogen.sh| 2 +- configure.ac | 29 ++--- m4/ac_check_class.m4 | 108 ++ m4/ac_check_classpath.m4 | 24 +++ m4/ac_check_rqrd_class.m4 | 26 +++ m4/ac_java_options.m4 | 33 ++ m4/ac_prog_jar.m4 | 39 + m4/ac_prog_java.m4| 83 +++ m4/ac_prog_java_works.m4 | 98 + m4/ac_prog_javac.m4 | 45 +++ m4/ac_prog_javac_works.m4 | 36 m4/ac_prog_javah.m4 | 28 m4/ac_try_compile_java.m4 | 40 + m4/ac_try_run_javac.m4| 41 ++ 14 files changed, 615 insertions(+), 17 deletions(-) create mode 100644 m4/ac_check_class.m4 create mode 100644 m4/ac_check_classpath.m4 create mode 100644 m4/ac_check_rqrd_class.m4 create mode 100644 m4/ac_java_options.m4 create mode 100644 m4/ac_prog_jar.m4 create mode 100644 m4/ac_prog_java.m4 create mode 100644 m4/ac_prog_java_works.m4 create mode 100644 m4/ac_prog_javac.m4 create mode 100644 m4/ac_prog_javac_works.m4 create mode 100644 m4/ac_prog_javah.m4 create mode 100644 m4/ac_try_compile_java.m4 create mode 100644 m4/ac_try_run_javac.m4 -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: geo replication
Right now, your only option is synchronous replication, which happens at the speed of the slowest OSD ... so unless your WAN links are fast and fat, it comes at non-negligible performance penalty. We will soon be sending out a proposal for an asynchronous replication mechanism with eventual consistency for the RADOS Gateway ... but that is a somewhat simpler problem (immutable objects, good change lists, and a WAN friendly protocol). Asynchronous RADOS replication is definitely on our list, but more complex and farther out. On 01/09/2013 01:19 PM, Gandalf Corvotempesta wrote: probably this was already asked before but i'm unable to find any answer. Is possible to replicate a cluster geografically? GlusterFS does this with rsync (i think called automatically on every file write), does cheph do something similiar? I don't think that using multiple geographically distributed OSD with 10-15ms of latency will be good -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] osd/ReplicatedPG.cc: fix errors in _scrub()
Fix build error introduced with 5b12b514b047a8a46cc5549bd94b398289b9b5f6: osd/ReplicatedPG.cc: In member function 'virtual void ReplicatedPG::_scrub(ScrubMap&)': osd/ReplicatedPG.cc:7116:4: error: 'errors' was not declared in this scope Increment scrubber.errors instead of errors. Signed-off-by: Danny Al-Gaaf --- src/osd/ReplicatedPG.cc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/osd/ReplicatedPG.cc b/src/osd/ReplicatedPG.cc index e8a68fe..1645041 100644 --- a/src/osd/ReplicatedPG.cc +++ b/src/osd/ReplicatedPG.cc @@ -7113,7 +7113,7 @@ void ReplicatedPG::_scrub(ScrubMap& scrubmap) if (head == hobject_t()) { osd->clog.error() << mode << " " << info.pgid << " " << soid << " found clone without head"; - ++errors; + ++scrubber.errors; continue; } -- 1.8.1 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows port
Hi, Along the same lines, (p)NFS access from Windows clients should already be possible, for some definition of possible. We'll make it actually possible over the next few months. Matt - "Sage Weil" wrote: > On Wed, 9 Jan 2013, Florian Haas wrote: > > On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey > wrote: > > > Hi, > > > > > > I am also curious if a Windows port, specifically the client-side, > is > > > on the roadmap. > > > > This is somewhat OT from the original post, but if all you're > > interested is using RBD block storage from Windows, you can already > do > > that by going through an iSCSI or FC head node. Proof-of-concept > > configuration outlined here: > > > > > http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices > > > > Not sure if this helps, but just thought I'd mention it. > > There is also a patch for Samba that glues libcephfs into Samba's VFS > > layer. This will let you reexport CephFS via CIFS. These patches are > > currently living at > > https://github.com/ceph/samba/commits/ceph-v3-6-test > > If anybody is interested in playing with these, have at it! Inktank > doesn't have resources to focus on it right now. > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
Hi, Thanks for the input. I also have tons of "socket closed", I recall that this message is harmless. Anyway Cephx is disable on my platform from the beginning... Anyone to approve or disapprove my "scrub theory"? -- Regards, Sébastien Han. On Wed, Jan 9, 2013 at 7:09 PM, Sylvain Munaut wrote: > Just fyi, I also have growing memory on OSD, and I have the same logs: > > "libceph: osd4 172.20.11.32:6801 socket closed" in the RBD clients > > > I traced that problem and correlated it to some cephx issue in the OSD > some time ago in this thread > > http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg10634.html > > but the thread kind of died without a solution ... > > Cheers, > >Sylvain -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
Just fyi, I also have growing memory on OSD, and I have the same logs: "libceph: osd4 172.20.11.32:6801 socket closed" in the RBD clients I traced that problem and correlated it to some cephx issue in the OSD some time ago in this thread http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg10634.html but the thread kind of died without a solution ... Cheers, Sylvain -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Windows port
On Wed, 9 Jan 2013, Florian Haas wrote: > On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey wrote: > > Hi, > > > > I am also curious if a Windows port, specifically the client-side, is > > on the roadmap. > > This is somewhat OT from the original post, but if all you're > interested is using RBD block storage from Windows, you can already do > that by going through an iSCSI or FC head node. Proof-of-concept > configuration outlined here: > > http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices > > Not sure if this helps, but just thought I'd mention it. There is also a patch for Samba that glues libcephfs into Samba's VFS layer. This will let you reexport CephFS via CIFS. These patches are currently living at https://github.com/ceph/samba/commits/ceph-v3-6-test If anybody is interested in playing with these, have at it! Inktank doesn't have resources to focus on it right now. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
On Wed, 9 Jan 2013, Dennis Jacobfeuerborn wrote: > On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote: > > Hi all, > > > > in expectation of better performance, we are just switching from CEPH > > version 0.48 to 0.56.1 > > for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. > > > > We are now wondering whether there are currently any development activities > > concerning further significant performance enhancements, > > or whether further significant performance enhancements are already planned > > for the near future. > > > > I would now be loath to start benchmarking with 0.56.1 and then, a month or > > so later, detect that there have been significant performance enhancements > > in CEPH in the meantime. > > There shouldn't be any major changes since v0.56.x is a stable release and > as such should only receive bug-/securityfixes and non-risky improvements. > Any changes that would result in a significant change in performance would > probably be too disruptive for a stable release series. That is generally true. One exception is that there may be some simple changes that can decrease the impact of data migration on performance. There are some changes we made for a customer that seem to make a big difference and will be making it into the main tree (and hopefully bobtail, and possibly even argonaut) shortly. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
If you wait too long, the system will trigger OOM killer :D, I already experienced that unfortunately... Sam? On Wed, Jan 9, 2013 at 5:10 PM, Dave Spano wrote: > OOM killer -- Regards, Sébastien Han. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
Yes, I'm using argonaut. I've got 38 heap files from yesterday. Currently, the OSD in question is using 91.2% of memory according to top, and staying there. I initially thought it would go until the OOM killer started killing processes, but I don't see anything funny in the system logs that indicate that. On the other hand, the ceph-osd process on osd.1 is using far less memory. osd.0 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 9151 root 20 0 20.4g 14g 2548 S1 91.2 517:58.71 ceph-osd osd.1 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 10785 root 20 0 673m 310m 5164 S3 1.9 107:04.39 ceph-osd Here's what tcmalloc says when I run ceph osd tell 0 heap stats: 2013-01-09 11:09:36.778675 7f62aae23700 0 log [INF] : osd.0tcmalloc heap stats: 2013-01-09 11:09:36.779113 7f62aae23700 0 log [INF] : MALLOC: 210884768 ( 201.1 MB) Bytes in use by application 2013-01-09 11:09:36.779348 7f62aae23700 0 log [INF] : MALLOC: + 89026560 ( 84.9 MB) Bytes in page heap freelist 2013-01-09 11:09:36.779928 7f62aae23700 0 log [INF] : MALLOC: + 7926512 ( 7.6 MB) Bytes in central cache freelist 2013-01-09 11:09:36.779951 7f62aae23700 0 log [INF] : MALLOC: + 144896 ( 0.1 MB) Bytes in transfer cache freelist 2013-01-09 11:09:36.779972 7f62aae23700 0 log [INF] : MALLOC: + 11046512 ( 10.5 MB) Bytes in thread cache freelists 2013-01-09 11:09:36.780013 7f62aae23700 0 log [INF] : MALLOC: + 5177344 ( 4.9 MB) Bytes in malloc metadata 2013-01-09 11:09:36.780030 7f62aae23700 0 log [INF] : MALLOC: 2013-01-09 11:09:36.780056 7f62aae23700 0 log [INF] : MALLOC: =324206592 ( 309.2 MB) Actual memory used (physical + swap) 2013-01-09 11:09:36.780081 7f62aae23700 0 log [INF] : MALLOC: +126177280 ( 120.3 MB) Bytes released to OS (aka unmapped) 2013-01-09 11:09:36.780112 7f62aae23700 0 log [INF] : MALLOC: 2013-01-09 11:09:36.780127 7f62aae23700 0 log [INF] : MALLOC: =450383872 ( 429.5 MB) Virtual address space used 2013-01-09 11:09:36.780152 7f62aae23700 0 log [INF] : MALLOC: 2013-01-09 11:09:36.780168 7f62aae23700 0 log [INF] : MALLOC: 37492 Spans in use 2013-01-09 11:09:36.780330 7f62aae23700 0 log [INF] : MALLOC: 51 Thread heaps in use 2013-01-09 11:09:36.780359 7f62aae23700 0 log [INF] : MALLOC: 4096 Tcmalloc page size 2013-01-09 11:09:36.780384 7f62aae23700 0 log [INF] : Dave Spano Optogenics Systems Administrator - Original Message - From: "Sébastien Han" To: "Samuel Just" Cc: "Dave Spano" , "ceph-devel" Sent: Wednesday, January 9, 2013 10:20:43 AM Subject: Re: OSD memory leaks? I guess he runs Argonaut as well. More suggestions about this problem? Thanks! -- Regards, Sébastien Han. On Mon, Jan 7, 2013 at 8:09 PM, Samuel Just wrote: > > Awesome! What version are you running (ceph-osd -v, include the hash)? > -Sam > > On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano wrote: > > This failed the first time I sent it, so I'm resending in plain text. > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > > > - Original Message - > > > > From: "Dave Spano" > > To: "Sébastien Han" > > Cc: "ceph-devel" , "Samuel Just" > > > > Sent: Monday, January 7, 2013 12:40:06 PM > > Subject: Re: OSD memory leaks? > > > > > > Sam, > > > > Attached are some heaps that I collected today. 001 and 003 are just after > > I started the profiler; 011 is the most recent. If you need more, or > > anything different let me know. Already the OSD in question is at 38% > > memory usage. As mentioned by Sèbastien, restarting ceph-osd keeps things > > going. > > > > Not sure if this is helpful information, but out of the two OSDs that I > > have running, the first one (osd.0) is the one that develops this problem > > the quickest. osd.1 does have the same issue, it just takes much longer. Do > > the monitors hit the first osd in the list first, when there's activity? > > > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > - Original Message - > > > > From: "Sébastien Han" > > To: "Samuel Just" > > Cc: "ceph-devel" > > Sent: Friday, January 4, 2013 10:20:58 AM > > Subject: Re: OSD memory leaks? > > > > Hi Sam, > > > > Thanks for your answer and sorry the late reply. > > > > Unfortunately I can't get something out from the profiler, actually I > > do but I guess it doesn't show what is supposed to show... I will keep
Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
Hi Jutta, On Wed, Jan 9, 2013 at 7:11 AM, Lachfeld, Jutta wrote: > > the current content of the web page http://ceph.com/docs/master/cephfs/hadoop > shows a configuration parameter ceph.object.size. > Is it the CEPH equivalent to the "HDFS block size" parameter which I have > been looking for? Yes. By specifying ceph.object.size, the Hadoop will use a default Ceph file layout with stripe unit = object size, and stripe count = 1. This is effectively the same meaning as dfs.block.size for HDFS. > Does the parameter ceph.object.size apply to version 0.56.1? The Ceph/Hadoop file system plugin is being developed here: git://github.com/ceph/hadoop-common cephfs/branch-1.0 There is an old version of the Hadoop plugin in the Ceph tree which will be removed shortly. Regarding the versions, development is taking place in cephfs/branch-1.0 and in ceph.git master. We don't yet have a system in place for dealing with compatibility across versions because the code is in heavy development. If you are running 0.56.1 then a recent version of cephfs/branch-1.0 should work with that, but may not long, as development continues. > I would be interested in setting this parameter to values higher than 64MB, > e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing > the performance of the TeraSort benchmark. Would these values be allowed and > would they at all make sense for the mechanisms used in CEPH? I can't think of any reason why a large size would cause concern, but maybe someone else can chime in? - Noah -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD memory leaks?
I guess he runs Argonaut as well. More suggestions about this problem? Thanks! -- Regards, Sébastien Han. On Mon, Jan 7, 2013 at 8:09 PM, Samuel Just wrote: > > Awesome! What version are you running (ceph-osd -v, include the hash)? > -Sam > > On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano wrote: > > This failed the first time I sent it, so I'm resending in plain text. > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > > > - Original Message - > > > > From: "Dave Spano" > > To: "Sébastien Han" > > Cc: "ceph-devel" , "Samuel Just" > > > > Sent: Monday, January 7, 2013 12:40:06 PM > > Subject: Re: OSD memory leaks? > > > > > > Sam, > > > > Attached are some heaps that I collected today. 001 and 003 are just after > > I started the profiler; 011 is the most recent. If you need more, or > > anything different let me know. Already the OSD in question is at 38% > > memory usage. As mentioned by Sèbastien, restarting ceph-osd keeps things > > going. > > > > Not sure if this is helpful information, but out of the two OSDs that I > > have running, the first one (osd.0) is the one that develops this problem > > the quickest. osd.1 does have the same issue, it just takes much longer. Do > > the monitors hit the first osd in the list first, when there's activity? > > > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > - Original Message - > > > > From: "Sébastien Han" > > To: "Samuel Just" > > Cc: "ceph-devel" > > Sent: Friday, January 4, 2013 10:20:58 AM > > Subject: Re: OSD memory leaks? > > > > Hi Sam, > > > > Thanks for your answer and sorry the late reply. > > > > Unfortunately I can't get something out from the profiler, actually I > > do but I guess it doesn't show what is supposed to show... I will keep > > on trying this. Anyway yesterday I just thought that the problem might > > be due to some over usage of some OSDs. I was thinking that the > > distribution of the primary OSD might be uneven, this could have > > explained that some memory leaks are more important with some servers. > > At the end, the repartition seems even but while looking at the pg > > dump I found something interesting in the scrub column, timestamps > > from the last scrubbing operation matched with times showed on the > > graph. > > > > After this, I made some calculation, I compared the total number of > > scrubbing operation with the time range where memory leaks occurred. > > First of all check my setup: > > > > root@c2-ceph-01 ~ # ceph osd tree > > dumped osdmap tree epoch 859 > > # id weight type name up/down reweight > > -1 12 pool default > > -3 12 rack lc2_rack33 > > -2 3 host c2-ceph-01 > > 0 1 osd.0 up 1 > > 1 1 osd.1 up 1 > > 2 1 osd.2 up 1 > > -4 3 host c2-ceph-04 > > 10 1 osd.10 up 1 > > 11 1 osd.11 up 1 > > 9 1 osd.9 up 1 > > -5 3 host c2-ceph-02 > > 3 1 osd.3 up 1 > > 4 1 osd.4 up 1 > > 5 1 osd.5 up 1 > > -6 3 host c2-ceph-03 > > 6 1 osd.6 up 1 > > 7 1 osd.7 up 1 > > 8 1 osd.8 up 1 > > > > > > And there are the results: > > > > * Ceph node 1 which has the most important memory leak performed 1608 > > in total and 1059 during the time range where memory leaks occured > > * Ceph node 2, 1168 in total and 776 during the time range where > > memory leaks occured > > * Ceph node 3, 940 in total and 94 during the time range where memory > > leaks occurred > > * Ceph node 4, 899 in total and 191 during the time range where > > memory leaks occurred > > > > I'm still not entirely sure that the scrub operation causes the leak > > but the only relevant relation that I found... > > > > Could it be that the scrubbing process doesn't release memory? Btw I > > was wondering, how ceph decides at what time it should run the > > scrubbing operation? I know that it's once a day and control by the > > following options > > > > OPTION(osd_scrub_min_interval, OPT_FLOAT, 300) > > OPTION(osd_scrub_max_interval, OPT_FLOAT, 60*60*24) > > > > But how ceph determined the time where the operation started, during > > cluster creation probably? > > > > I just checked the options that control OSD scrubbing and found that by > > default: > > > > OPTION(osd_max_scrubs, OPT_INT, 1) > > > > So that might explain why only one OSD uses a lot of memory. > > > > My dirty workaround at the moment is to performed a check of memory > > use by every OSD and restart it if it uses more than 25% of the total > > memory. Also note that on ceph 1, 3 and 4 it's always one OSD that > > uses a lot of memory, for ceph 2 only the mem usage is high but almost > > the same for all the OSD process. > > > > Thank you in advance. > > > > -- > > Regards, > > Sébastien Han. > > > > > > On Wed, Dec 19, 2012 at 10:43 PM, Samuel Just wrote: > >> > >> Sorry, it's been very busy. The next step would to try to get a heap > >> dump. You can start a heap profile on osd N by: > >> > >> ceph osd tell N heap start_profiler > >> > >> and you can get it to dump the collected profile using > >> > >> ceph osd tell N h
RE: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark performance comparison issue
Hi Noah, the current content of the web page http://ceph.com/docs/master/cephfs/hadoop shows a configuration parameter ceph.object.size. Is it the CEPH equivalent to the "HDFS block size" parameter which I have been looking for? Does the parameter ceph.object.size apply to version 0.56.1? I would be interested in setting this parameter to values higher than 64MB, e.g. 256MB or 512MB similar to the values I have used for HDFS for increasing the performance of the TeraSort benchmark. Would these values be allowed and would they at all make sense for the mechanisms used in CEPH? Regards, Jutta. - jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint > -Original Message- > From: Noah Watkins [mailto:jayh...@cs.ucsc.edu] > Sent: Thursday, December 13, 2012 9:33 PM > To: Gregory Farnum > Cc: Cameron Bahar; Sage Weil; Lachfeld, Jutta; ceph-devel@vger.kernel.org; > Noah > Watkins; Joe Buck > Subject: Re: Usage of CEPH FS versa HDFS for Hadoop: TeraSort benchmark > performance comparison issue > > The bindings use the default Hadoop settings (e.g. 64 or 128 MB > chunks) when creating new files. The chunk size can also be specified on a > per-file basis > using the same interface as Hadoop. Additionally, while Hadoop doesn't > provide an > interface to configuration parameters beyond chunk size, we will also let > users fully > configure for any Ceph striping strategy. > http://ceph.com/docs/master/dev/file-striping/ > > -Noah > > On Thu, Dec 13, 2012 at 12:27 PM, Gregory Farnum wrote: > > On Thu, Dec 13, 2012 at 12:23 PM, Cameron Bahar wrote: > >> Is the chunk size tunable in A Ceph cluster. I don't mean dynamic, but > >> even statically > configurable when a cluster is first installed? > > > > Yeah. You can set chunk size on a per-file basis; you just can't > > change it once the file has any data written to it. > > In the context of Hadoop the question is just if the bindings are > > configured correctly to do so automatically. > > -Greg > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majord...@vger.kernel.org More majordomo > > info at http://vger.kernel.org/majordomo-info.html N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: Crushmap Design Question
On 01/09/2013 08:59 AM, Wido den Hollander wrote: > Hi, > > On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote: >> Hi, >> Setting rep size to 3 only make the data triple-replication, that means >> when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable. >> But Monitor is another story, for monitor clusters with 2N+1 nodes, it >> require at least N+1 nodes alive, and indeed this is why you Ceph failed. >> It looks to me this discipline make it hard to design a proper >> deployment which is robust in DC outage. But hoping for inputs from >> community,how to make Monitor cluster reliable. >> > > From what I understand he didn't kill the second mon, still leaving 2 > out of 3 mons running. Indeed. A good hint that this is the case is this bit of Shawn's message: >> When I fail a datacenter (including 1 of 3 mon's) I eventually get: >> 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 >> active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; >> 16362/49086 degraded (33.333%) >> >> At this point everything is still ok. But when I fail the 2nd datacenter >> (still leaving 2 out of 3 mons running) I get: >> 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 >> incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail If you still manage to get these messages, it means your monitors are still handling and answering requests, and that only happens when you have a quorum :) -Joao -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Crushmap Design Question
Correct, it never went below N+1 (3 total mons and 2 of them never went down). Several times in the past I verified that a pg was actually mapped to valid dc's with that command. I just wrote a quick script that will do this on the fly and after recovering the cluster last night, every pg has an osd mapping respective to an osd in a dc. I will fail the cluster again later today and see what it looks like after 1 dc fails and then again after the 2nd fails. As far as the weighting goes, I'm not sure how I ended up this way. So should I change the "adm" tree: FROM -25 8 datacenter adm -16 8 host admdisk0 TO -25 36 datacenter adm -16 1 host admdisk0 Regards -Original Message- From: Wido den Hollander [mailto:w...@widodh.nl] Sent: Wednesday, January 09, 2013 4:00 AM To: Chen, Xiaoxi Cc: Moore, Shawn M; ceph-devel@vger.kernel.org Subject: Re: Crushmap Design Question Hi, On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote: > Hi, > Setting rep size to 3 only make the data triple-replication, that means > when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable. > But Monitor is another story, for monitor clusters with 2N+1 nodes, it > require at least N+1 nodes alive, and indeed this is why you Ceph failed. > It looks to me this discipline make it hard to design a proper > deployment which is robust in DC outage. But hoping for inputs from > community,how to make Monitor cluster reliable. > From what I understand he didn't kill the second mon, still leaving 2 out of 3 mons running. Could you check if your PGs are actually mapped to OSDs spread out over the 3 DCs? "ceph pg dump" should tell you to which OSDs the PGs are mapped. I've never tried before, but you don't have equal weights for the datacenters, I don't know how that effects the situation. Wido > > > Xiaoxi > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Moore, Shawn M > Sent: 2013年1月9日 4:21 > To: ceph-devel@vger.kernel.org > Subject: Crushmap Design Question > > I have been testing ceph for a little over a month now. Our design goal is > to have 3 datacenters in different buildings all tied together over 10GbE. > Currently there are 10 servers each serving 1 osd in 2 of the datacenters. > In the third is one large server with 16 SAS disks serving 8 osds. > Eventually we will add one more identical large server into the third > datacenter. I have told ceph to keep 3 copies and tried to do the crushmap > in such a way that as long as a majority of mon's can stay up, we could run > off of one datacenter's worth of osds. So in my testing, it doesn't work > out quite this way... > > Everything is currently ceph version 0.56.1 > (e4a541624df62ef353e754391cbbb707f54b16f7) > > I will put hopefully relevant files at the end of this email. > > When all 28 osds are up, I get: > 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 > active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > When I fail a datacenter (including 1 of 3 mon's) I eventually get: > 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 > active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; > 16362/49086 degraded (33.333%) > > At this point everything is still ok. But when I fail the 2nd datacenter > (still leaving 2 out of 3 mons running) I get: > 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 > incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > Most VM's quit working and "rbd ls" works, but not a single line from "rados > -p rbd ls" works and the command hangs. Now after a while (you can see from > timestamps) I end up at and stays this way: > 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, > 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 > remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; > 7696/49086 degraded (15.679%) > > I'm hoping I've done something wrong, so please advise. Below are my > configs. If you need something more to help, just ask. > > Normal output with all datacenters up. > # ceph osd tree > # id weight type name up/down reweight > -180 root default > -336 datacenter hok > -21 host blade151 > 0 1 osd.0 up 1 > -41 host blade152 > 1 1 osd.1 up 1 > -15 1 host blade153 > 2 1
Re: Windows port
On Tue, Jan 8, 2013 at 3:00 PM, Dino Yancey wrote: > Hi, > > I am also curious if a Windows port, specifically the client-side, is > on the roadmap. This is somewhat OT from the original post, but if all you're interested is using RBD block storage from Windows, you can already do that by going through an iSCSI or FC head node. Proof-of-concept configuration outlined here: http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices Not sure if this helps, but just thought I'd mention it. Cheers, Florian -- Helpful information? Let us know! http://www.hastexo.com/shoutbox -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
On 01/09/2013 06:51 AM, Lachfeld, Jutta wrote: Hi all, in expectation of better performance, we are just switching from CEPH version 0.48 to 0.56.1 for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. We are now wondering whether there are currently any development activities concerning further significant performance enhancements, or whether further significant performance enhancements are already planned for the near future. I would now be loath to start benchmarking with 0.56.1 and then, a month or so later, detect that there have been significant performance enhancements in CEPH in the meantime. Regards, Jutta. - jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi Jutta, As Wido mentioned there have been some performance improvements, especially with small IO sizes. The conclusion section of the performance preview may be useful for you: http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/ One oddity is that there may have been some regression for 128k reads. Overall though I'd say that performance has improved, especially on XFS. I don't think it's likely we will be pushing any performance patches to the bobtail series, but it's possible performance could change as a result of a bug fix. For what it's worth, I've started performing sweeps over ceph parameter spaces (and looking at underlying io schedulers) to see how tuning affects ceph performance under different scenarios. I'm hoping to be able to release the results later this month. Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
Performance work is always ongoing, but I am not aware of any significant imminent enhancements. We are just wrapping up an investigation of the effects of various file system and I/O options on different types of traffic, and the next major area of focus will be RADOS Block Device and VMs over RBD. This is pretty far away from Hadoop and probably won't yield much fruit until March. There are a few people working on Hadoop integration, and I have not been closely following their activities, but I do not believe that any major performance work will be forthcoming in the next few weeks On 01/09/2013 04:51 AM, Lachfeld, Jutta wrote: Hi all, in expectation of better performance, we are just switching from CEPH version 0.48 to 0.56.1 for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. We are now wondering whether there are currently any development activities concerning further significant performance enhancements, or whether further significant performance enhancements are already planned for the near future. I would now be loath to start benchmarking with 0.56.1 and then, a month or so later, detect that there have been significant performance enhancements in CEPH in the meantime. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD's slow down to a crawl
On 01/09/2013 02:52 AM, Matthew Anderson wrote: Hi Sage, Sorry for the late follow up, I've been on a bit of a testing rampage and managed to somewhat sort the problem. Most of the problems appears to be from the 3.7.1 kernel. It seems to have a fairly big issue with its networking stack that was causing Ceph's network operations to hang. Moving back to a 3.6.8 kernel fixed this up. I don't know the full extent of the problem but it was reported on Phoronix briefly here - http://www.phoronix.com/scan.php?page=news_item&px=MTI2Nzc The second issue was BTRFS on both the 3.7 and 3.6.8 kernels. After running a long rados bench (10 minutes) on a fresh cluster it would often slow down significantly by going from 250MB/s down to a 50MB/s average. Latency also increased dramatically. Restarting the OSD's fixes the issue but after a while it slows right down again. In the end I re-formatted the cluster using XFS (and also EXT4 for benchmarks) and there wasn't a single issue. I had rados bench running for over 30 minutes from another machine and there wasn't a single issue. Ah, too bad this is still happening. :( It's interesting though that restarting the OSDs fixes it. That's not something I expected. Sounds like I need to run some more tests again and see if I can get to the bottom of it. At thisstage I need to start moving into production with XFS. My test cluster arrives in a few weeks so I should be able to come back to the BTRFS issue later on as it would be very handy to have compression working. Thanks again for your help -Matt -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Saturday, 22 December 2012 12:02 AM To: Matthew Anderson Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org Subject: RE: OSD's slow down to a crawl On Fri, 21 Dec 2012, Matthew Anderson wrote: Hi Sage, I've tried to reproduce the error again with logging on every OSD and got the above. RADOS bench had stalled on a write request like the last time and the attached log is the grep'd OSD log (# cat osd.25.log | grep client.9501.0:744> freeze.log) . The OSD that stalled was 25, pg map is below - # ceph pg map 6.5d83495b osdmap e3775 pg 6.5d83495b (6.95b) -> up [25,31] acting [25,31] I hope that's what you were after, if not just let me know We're getting closer. The osd tried to send the reply. Can you reproduce with 'debug ms = 20' on the osds too, and on the client side do soemthing like rados --debug-ms 20 --debug-objecter 20 --log-file /tmp/foo ... Thanks! sage Thanks again -Matt -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Friday, 21 December 2012 1:14 AM To: Matthew Anderson Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org Subject: RE: OSD's slow down to a crawl On Thu, 20 Dec 2012, Matthew Anderson wrote: Hi Sage, Logs are attached. I took the osd logs from osd.24 as this is the first osd in my SSD pool I've been testing with previously. The 4MB bench I was able to reproduce the fault by restarting my rbd export which stalled after a few percent complete. When I ran the 4MB bench it stalled early on and never received a response back from the OSD and I terminated it after 60 seconds or so. I wasn't able to reproduce the fault using the 4kb io size. The 4kb log should show rados bench completing normally at a respectable speed of about 1MB/s. Let's drill into the hang.. up until that point things look okay. 2012-12-21 00:51:26.033622 7f6f3c042760 1 -- 172.16.0.13:0/1023886 --> 172.16.0.13:6813/22233 -- osd_op(client.9503.0:185 benchmark_data_KVM04_23886_object184 [write 0~4194304] 6.3ca4346e) v4 -- ?+0 0x171ea50 con 0x171a7e0 Do you have a log for that OSD so we can see what happened there? It may also be that the replicated write is hung. If you do ceph pg map 6.3ca4346e you can see all OSDs storing that PG. And/or you can grep for client.9503.0:185 in 172.16.0.13:6813/22233's log and see whether the sub_op was sent. Thanks! sage Thanks -Matt -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Friday, 21 December 2012 12:30 AM To: Matthew Anderson Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org Subject: RE: OSD's slow down to a crawl Can you do a similar test, but with full logging on? ceph tell osd.0 injectargs '--debug-ms 1 --debug-filestore 20 --debug-osd 20 --debug-journal 20' rados -p ssd bench 30 write -b 4096 -t 1 --log-file /tmp/foo --debug-ms 1 That will be a single IO in flight at a time and very easy to trace through the logs. If you can post the resulting log file (/tmp/foo and from osd.0), that would be awesome. Thanks! sage On Thu, 20 Dec 2012, Matthew Anderson wrote: # rados bench 60 write -t 256 -p ssd Maintaining 256 concurrent writes of 4194304 bytes for at least 60 seconds. Object prefix: benchmark_data_KVM03_12985 sec Cur ops started finished avg MB/s cur MB/s last lat avg
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote: > Hi all, > > in expectation of better performance, we are just switching from CEPH version > 0.48 to 0.56.1 > for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. > > We are now wondering whether there are currently any development activities > concerning further significant performance enhancements, > or whether further significant performance enhancements are already planned > for the near future. > > I would now be loath to start benchmarking with 0.56.1 and then, a month or > so later, detect that there have been significant performance enhancements in > CEPH in the meantime. There shouldn't be any major changes since v0.56.x is a stable release and as such should only receive bug-/securityfixes and non-risky improvements. Any changes that would result in a significant change in performance would probably be too disruptive for a stable release series. Regards, Dennis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
Hi, > > Yes, 0.56(.1) has a significant performance increase compared to 0.48 > That is not exactly the OP's question, though. If I understand correctly, she is concerned about ongoing performance improvements within the "bobtail" branch, i.e. between 0.56.1 and 0.56.X (with X>1). Jutta, what kind of use case do you have in mind, i.e. how complex are your benchmarking scenarios? Regards, --ck -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
On 01/09/2013 01:51 PM, Lachfeld, Jutta wrote: Hi all, in expectation of better performance, we are just switching from CEPH version 0.48 to 0.56.1 for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. We are now wondering whether there are currently any development activities concerning further significant performance enhancements, or whether further significant performance enhancements are already planned for the near future. Yes, 0.56(.1) has a significant performance increase compared to 0.48 Two blogposts which might be interesting to read: * http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/ * http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/ I'm not running with HDFS, but I see a good performance increase with Virtual Machines running on RBD. Wido I would now be loath to start benchmarking with 0.56.1 and then, a month or so later, detect that there have been significant performance enhancements in CEPH in the meantime. Regards, Jutta. - jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
Hi all, in expectation of better performance, we are just switching from CEPH version 0.48 to 0.56.1 for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. We are now wondering whether there are currently any development activities concerning further significant performance enhancements, or whether further significant performance enhancements are already planned for the near future. I would now be loath to start benchmarking with 0.56.1 and then, a month or so later, detect that there have been significant performance enhancements in CEPH in the meantime. Regards, Jutta. - jutta.lachf...@ts.fujitsu.com, Fujitsu Technology Solutions PBG PDG ES&S SWE SOL 4, "Infrastructure Solutions", MchD 5B, Tel. ..49-89-3222-2705, Company Details: http://de.ts.fujitsu.com/imprint -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crushmap Design Question
Hi, On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote: > Hi, > Setting rep size to 3 only make the data triple-replication, that means > when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable. > But Monitor is another story, for monitor clusters with 2N+1 nodes, it > require at least N+1 nodes alive, and indeed this is why you Ceph failed. > It looks to me this discipline make it hard to design a proper > deployment which is robust in DC outage. But hoping for inputs from > community,how to make Monitor cluster reliable. > >From what I understand he didn't kill the second mon, still leaving 2 out of 3 mons running. Could you check if your PGs are actually mapped to OSDs spread out over the 3 DCs? "ceph pg dump" should tell you to which OSDs the PGs are mapped. I've never tried before, but you don't have equal weights for the datacenters, I don't know how that effects the situation. Wido > > > Xiaoxi > > > -Original Message- > From: ceph-devel-ow...@vger.kernel.org > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Moore, Shawn M > Sent: 2013年1月9日 4:21 > To: ceph-devel@vger.kernel.org > Subject: Crushmap Design Question > > I have been testing ceph for a little over a month now. Our design goal is > to have 3 datacenters in different buildings all tied together over 10GbE. > Currently there are 10 servers each serving 1 osd in 2 of the datacenters. > In the third is one large server with 16 SAS disks serving 8 osds. > Eventually we will add one more identical large server into the third > datacenter. I have told ceph to keep 3 copies and tried to do the crushmap > in such a way that as long as a majority of mon's can stay up, we could run > off of one datacenter's worth of osds. So in my testing, it doesn't work > out quite this way... > > Everything is currently ceph version 0.56.1 > (e4a541624df62ef353e754391cbbb707f54b16f7) > > I will put hopefully relevant files at the end of this email. > > When all 28 osds are up, I get: > 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 > active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > When I fail a datacenter (including 1 of 3 mon's) I eventually get: > 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 > active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; > 16362/49086 degraded (33.333%) > > At this point everything is still ok. But when I fail the 2nd datacenter > (still leaving 2 out of 3 mons running) I get: > 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 > incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > Most VM's quit working and "rbd ls" works, but not a single line from "rados > -p rbd ls" works and the command hangs. Now after a while (you can see from > timestamps) I end up at and stays this way: > 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, > 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 > remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; > 7696/49086 degraded (15.679%) > > I'm hoping I've done something wrong, so please advise. Below are my > configs. If you need something more to help, just ask. > > Normal output with all datacenters up. > # ceph osd tree > # id weight type name up/down reweight > -180 root default > -336 datacenter hok > -21 host blade151 > 0 1 osd.0 up 1 > -41 host blade152 > 1 1 osd.1 up 1 > -15 1 host blade153 > 2 1 osd.2 up 1 > -17 1 host blade154 > 3 1 osd.3 up 1 > -18 1 host blade155 > 4 1 osd.4 up 1 > -19 1 host blade159 > 5 1 osd.5 up 1 > -20 1 host blade160 > 6 1 osd.6 up 1 > -21 1 host blade161 > 7 1 osd.7 up 1 > -22 1 host blade162 > 8 1 osd.8 up 1 > -23 1 host blade163 > 9 1 osd.9 up 1 > -24 36 datacenter csc > -51 host admbc0-01 > 101 osd.10 up 1 > -61
RE: OSD's slow down to a crawl
Hi Sage, Sorry for the late follow up, I've been on a bit of a testing rampage and managed to somewhat sort the problem. Most of the problems appears to be from the 3.7.1 kernel. It seems to have a fairly big issue with its networking stack that was causing Ceph's network operations to hang. Moving back to a 3.6.8 kernel fixed this up. I don't know the full extent of the problem but it was reported on Phoronix briefly here - http://www.phoronix.com/scan.php?page=news_item&px=MTI2Nzc The second issue was BTRFS on both the 3.7 and 3.6.8 kernels. After running a long rados bench (10 minutes) on a fresh cluster it would often slow down significantly by going from 250MB/s down to a 50MB/s average. Latency also increased dramatically. Restarting the OSD's fixes the issue but after a while it slows right down again. In the end I re-formatted the cluster using XFS (and also EXT4 for benchmarks) and there wasn't a single issue. I had rados bench running for over 30 minutes from another machine and there wasn't a single issue. At thisstage I need to start moving into production with XFS. My test cluster arrives in a few weeks so I should be able to come back to the BTRFS issue later on as it would be very handy to have compression working. Thanks again for your help -Matt -Original Message- From: Sage Weil [mailto:s...@inktank.com] Sent: Saturday, 22 December 2012 12:02 AM To: Matthew Anderson Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org Subject: RE: OSD's slow down to a crawl On Fri, 21 Dec 2012, Matthew Anderson wrote: > Hi Sage, > > I've tried to reproduce the error again with logging on every OSD and > got the above. RADOS bench had stalled on a write request like the > last time and the attached log is the grep'd OSD log (# cat osd.25.log > | grep client.9501.0:744 > freeze.log) . The OSD that stalled was 25, > pg map is below - > > # ceph pg map 6.5d83495b > osdmap e3775 pg 6.5d83495b (6.95b) -> up [25,31] acting [25,31] > > I hope that's what you were after, if not just let me know We're getting closer. The osd tried to send the reply. Can you reproduce with 'debug ms = 20' on the osds too, and on the client side do soemthing like rados --debug-ms 20 --debug-objecter 20 --log-file /tmp/foo ... Thanks! sage > > Thanks again > -Matt > > > -Original Message- > From: Sage Weil [mailto:s...@inktank.com] > Sent: Friday, 21 December 2012 1:14 AM > To: Matthew Anderson > Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org > Subject: RE: OSD's slow down to a crawl > > On Thu, 20 Dec 2012, Matthew Anderson wrote: > > Hi Sage, > > > > Logs are attached. I took the osd logs from osd.24 as this is the > > first osd in my SSD pool I've been testing with previously. > > > > The 4MB bench I was able to reproduce the fault by restarting my rbd > > export which stalled after a few percent complete. When I ran the > > 4MB bench it stalled early on and never received a response back > > from the OSD and I terminated it after 60 seconds or so. I wasn't > > able to reproduce the fault using the 4kb io size. The 4kb log > > should show rados bench completing normally at a respectable speed of about > > 1MB/s. > > Let's drill into the hang.. up until that point things look okay. > > 2012-12-21 00:51:26.033622 7f6f3c042760 1 -- 172.16.0.13:0/1023886 > --> 172.16.0.13:6813/22233 -- osd_op(client.9503.0:185 > benchmark_data_KVM04_23886_object184 [write 0~4194304] 6.3ca4346e) v4 > -- ?+0 0x171ea50 con 0x171a7e0 > > Do you have a log for that OSD so we can see what happened there? It > may also be that the replicated write is hung. If you do > > ceph pg map 6.3ca4346e > > you can see all OSDs storing that PG. And/or you can grep for > client.9503.0:185 in 172.16.0.13:6813/22233's log and see whether the sub_op > was sent. > > Thanks! > sage > > > > > > Thanks > > -Matt > > > > -Original Message- > > From: ceph-devel-ow...@vger.kernel.org > > [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil > > Sent: Friday, 21 December 2012 12:30 AM > > To: Matthew Anderson > > Cc: 'Mark Nelson'; ceph-devel@vger.kernel.org > > Subject: RE: OSD's slow down to a crawl > > > > Can you do a similar test, but with full logging on? > > > > ceph tell osd.0 injectargs '--debug-ms 1 --debug-filestore 20 > > --debug-osd > > 20 --debug-journal 20' > > rados -p ssd bench 30 write -b 4096 -t 1 --log-file /tmp/foo > > --debug-ms 1 > > > > That will be a single IO in flight at a time and very easy to trace through > > the logs. If you can post the resulting log file (/tmp/foo and from > > osd.0), that would be awesome. > > > > Thanks! > > sage > > > > > > > > On Thu, 20 Dec 2012, Matthew Anderson wrote: > > > > > # rados bench 60 write -t 256 -p ssd Maintaining 256 concurrent > > > writes of 4194304 bytes for at least 60 seconds. > > > Object prefix: benchmark_data_KVM03_12985 > > >sec Cur ops started finished avg MB/s cur MB/s l
Re: Is Ceph recovery able to handle massive crash
Hello, Le 09/01/2013 00:36, Gregory Farnum a écrit : It looks like it's taking approximately forever for writes to complete to disk; it's shutting down because threads are going off to write and not coming back. If you set "osd op thread timeout = 60" (or 120) it might manage to churn through, but I'd look into why the writes are taking so long — bad disk, fragmented btrfs filesystem, or something else. I believe it is a BTRFS issue as when I mkfs.btrfs the volume and rejoin it to the cluster, it works (OSD is staying up). Denis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html