Re: Client can't reboot when rbd volume is mounted.
On 11.02.2013 17:52, Sage Weil wrote: On Mon, 11 Feb 2013, Roman Alekseev wrote: On 11.02.2013 09:36, Sage Weil wrote: On Mon, 11 Feb 2013, Roman Alekseev wrote: Hi, When I try to reboot a client server without unmounting of rbd volume manually its services stop working but server doesn't reboot completely and show the following logs in KVM console: [235618.0202207] libceph: connect 192.168.0.19:6789 error -101 That is #defineENETUNREACH 101 /* Network is unreachable */ Note that that (or any other) socket error is not necessarily fatal; the kernel client will retry and eventually connect to that or another OSD to complete the IO. Are you observing that the RBD image hangs or something? You can peek at in-flight IO (and other state) with cat /sys/kernel/debug/ceph/*/osdc unmount/unmap should not be necessarily in any case unless there is a bug. We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay. You might try a newer 3.6.x kernel too; I forget if there was a second batch of fixes.. sage Hi Sage, #define ENETUNREACH 101 /* Network is unreachable */ The reason of this error is that networking stop working after performing server reset request. Are you observing that the RBD image hangs or something? the RBD works properly. It is just mapped and mounted on the client server. # /dev/rbd1 99G 616M 93G 1% /home/test I think I'm confused about what you mean by 'server'. Do you mean the host that rbd is mapped on, or the host(s) where the ceph-osd's are running? By 'the RBD works properly' do you mean the client where it is mapped? In which case, what exactly is the problem? I mean the host that rbd is mapped on. This host doesn't want to restart until rbd volume is mounted:) In order to get server restarted we need to umount rbd volume manually before performing "reboot" command. The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' content into it? 'mount -t debugfs none /sys/kernel/debug' and it will appear (along with other fun stuff)... sage I've update kernel to 3.7.4 version but problem is still persist. Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks. -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow requests, hunting for new mon
Hi, What are likely causes for "slow requests" and "monclient: hunting for new mon" messages? E.g.: 2013-02-12 16:27:07.318943 7f9c0bc16700 0 monclient: hunting for new mon ... 2013-02-12 16:27:45.892314 7f9c13c26700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.383883 secs 2013-02-12 16:27:45.892323 7f9c13c26700 0 log [WRN] : slow request 30.383883 seconds old, received at 2013-02-12 16:27:15.508374: osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.0120 [write 921600~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892328 7f9c13c26700 0 log [WRN] : slow request 30.383782 seconds old, received at 2013-02-12 16:27:15.508475: osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.0120 [write 987136~4096] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892334 7f9c13c26700 0 log [WRN] : slow request 30.383720 seconds old, received at 2013-02-12 16:27:15.508537: osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.0120 [write 1036288~8192] 2.981cf6bc) v4 currently no flag points reached 2013-02-12 16:27:45.892338 7f9c13c26700 0 log [WRN] : slow request 30.383684 seconds old, received at 2013-02-12 16:27:15.508573: osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.0122 [write 1454080~4096] 2.fff29a9a) v4 currently no flag points reached 2013-02-12 16:27:45.892341 7f9c13c26700 0 log [WRN] : slow request 30.328986 seconds old, received at 2013-02-12 16:27:15.563271: osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.0122 [write 1482752~4096] 2.fff29a9a) v4 currently no flag points reached I have a ceph 0.56.2 system with 3 boxes: two boxes have both mon and a single osd, and the 3rd box has just a mon - see ceph.conf below. The boxes are running an eclectic mix of self-compiled kernels: b2 is linux-3.4.6, b4 is linux-3.7.3 and b5 is linux-3.6.10. On b5 / osd.1 the 'hunting' message appears in the osd log regularly, e.g. 190 times yesterday. The message does't appear at all on b4 / osd.0. Both osd logs show the 'slow requests' messages. Generally these come in waves, with 30-50 of the associated individual 'slow request' messages coming in just after the initial 'slow requests' message. Each box saw around 30 waves yesterday, with no obvious time correlation between the two. The osd disks are generally cruising along at around 400-800 KB/s, with occasional spikes up to 1.2-2 MB/s, with a mostly write load. The gigabit network interfaces (2 per box for public vs cluster) are also cruising, with the busiest interface at less than 5% utilisation. CPU utilisation is likewise small, with 90% or more idle and less then 3% wait for io. There's plenty of free memory, 19 GB on b4 and 6 GB on b5. Cheers, Chris ceph.conf [global] auth supported = cephx [mon] [mon.b2] host = b2 mon addr = 10.200.63.130:6789 [mon.b4] host = b4 mon addr = 10.200.63.132:6789 [mon.b5] host = b5 mon addr = 10.200.63.133:6789 [osd] osd journal size = 1000 public network = 10.200.63.0/24 cluster network = 192.168.254.0/24 [osd.0] host = b4 public addr = 10.200.63.132 cluster addr = 192.168.254.132 [osd.1] host = b5 public addr = 10.200.63.133 cluster addr = 192.168.254.133 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: .gitignore issues
On 02/11/2013 06:28 PM, David Zafman wrote: After updating to latest master I have the following files listed by git status: These are mostly renamed binaries. If you run 'make clean' on the version before the name changes (133295ed001a950e3296f4e88a916ab2405be0cc) they'll be removed. If you're sure you have nothing you want to save that's not in a commit, you can always 'git clean -fdx'. src/ceph.conf and src/keyring are generated by vstart.sh, and I forgot to add them to .gitignore again earlier. There was also a typo in ceph-filestore-dump - it was not renamed. These are fixed now. Josh $ git status # On branch master # Untracked files: # (use "git add ..." to include in what will be committed) # # src/bench_log # src/ceph-filestore-dump # src/ceph.conf # src/dupstore # src/keyring # src/kvstorebench # src/multi_stress_watch # src/omapbench # src/psim # src/radosacl # src/scratchtool # src/scratchtoolpp # src/smalliobench # src/smalliobenchdumb # src/smalliobenchfs # src/smalliobenchrbd # src/streamtest # src/testcrypto # src/testkeys # src/testrados # src/testrados_delete_pools_parallel # src/testrados_list_parallel # src/testrados_open_pools_parallel # src/testrados_watch_notify # src/testsignal_handlers # src/testtimers # src/tpbench # src/xattr_bench nothing added to commit but untracked files present (use "git add" to track) David Zafman Senior Developer david.zaf...@inktank.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
.gitignore issues
After updating to latest master I have the following files listed by git status: $ git status # On branch master # Untracked files: # (use "git add ..." to include in what will be committed) # # src/bench_log # src/ceph-filestore-dump # src/ceph.conf # src/dupstore # src/keyring # src/kvstorebench # src/multi_stress_watch # src/omapbench # src/psim # src/radosacl # src/scratchtool # src/scratchtoolpp # src/smalliobench # src/smalliobenchdumb # src/smalliobenchfs # src/smalliobenchrbd # src/streamtest # src/testcrypto # src/testkeys # src/testrados # src/testrados_delete_pools_parallel # src/testrados_list_parallel # src/testrados_open_pools_parallel # src/testrados_watch_notify # src/testsignal_handlers # src/testtimers # src/tpbench # src/xattr_bench nothing added to commit but untracked files present (use "git add" to track) David Zafman Senior Developer david.zaf...@inktank.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
Yes, there were osd daemons running on the same node that the monitor was running on. If that is the case then i will run a test case with the monitor running on a different node where no osd is running and see what happens. Thank you. Isaac From: Gregory Farnum To: Isaac Otsiabah Cc: "ceph-devel@vger.kernel.org" Sent: Monday, February 11, 2013 12:29 PM Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster jIsaac, I'm sorry I haven't been able to wrangle any time to look into this more yet, but Sage pointed out in a related thread that there might be some buggy handling of things like this if the OSD and the monitor are located on the same host. Am I correct in assuming that with your small cluster, all your OSDs are co-located with a monitor daemon? -Greg On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah wrote: > > > Gregory, i recreated the osd down problem again this morning on two nodes > (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) > and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute > and half after adding osd 3, 4, 5 were adde4d. i have included the routing > table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log > files are attached. The crush map was default. Also, it could be a timing > issue because it does not always fail when using default crush map, it takes > several trials before you see it. Thank you. > > > [root@g13ct ~]# netstat -r > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > default 133.164.98.250 0.0.0.0 UG 0 0 0 eth2 > 133.164.98.0 * 255.255.255.0 U 0 0 0 eth2 > link-local * 255.255.0.0 U 0 0 0 eth3 > link-local * 255.255.0.0 U 0 0 0 eth0 > link-local * 255.255.0.0 U 0 0 0 eth2 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth0 > 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 > 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 > [root@g13ct ~]# ceph osd tree > > # id weight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down 1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > > > [root@g14ct ~]# ceph osd tree > > # id weight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down 1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > [root@g14ct ~]# netstat -r > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > default 133.164.98.250 0.0.0.0 UG 0 0 0 eth0 > 133.164.98.0 * 255.255.255.0 U 0 0 0 eth0 > link-local * 255.255.0.0 U 0 0 0 eth3 > link-local * 255.255.0.0 U 0 0 0 eth5 > link-local * 255.255.0.0 U 0 0 0 eth0 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth5 > 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 > 192.168.1.0 * 255.255.255.0 U 0 0 0 eth5 > [root@g14ct ~]# ceph osd tree > > # id weight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down 1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up
Re: File exists not handled in 0.48argonaut1
The actual problem appears to be a corrupted log file. You should rename out of the way the directory: /mnt/osd97/current/corrupt_log_2013-02-08_18:50_2.fa8. Then, restart the osd with debug osd = 20, debug filestore = 20, and debug ms = 1 in the [osd] section of the ceph.conf. -Sam On Mon, Feb 11, 2013 at 2:21 PM, Mandell Degerness wrote: > Since the attachment didn't work, apparently, here is a link to the log: > > http://dl.dropbox.com/u/766198/error17.log.gz > > On Mon, Feb 11, 2013 at 1:42 PM, Samuel Just wrote: >> I don't see the more complete log. >> -Sam >> >> On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness >> wrote: >>> Anyone have any thoughts on this??? It looks like I may have to wipe >>> out the OSDs effected and rebuild them, but I'm afraid that may result >>> in data loss because of the old OSD first crush map in place :(. >>> >>> On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness >>> wrote: We ran into an error which appears very much like a bug fixed in 0.44. This cluster is running version: ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) The error line is: Feb 8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682 7f40f9f08700 0 filestore(/mnt/osd97) error (17) File exists not handled on operation 20 (11279344.0.0, or op 0, counting from 0) A more complete log is attached. First question: is this a know bug fixed in more recent versions? Second question: is there any hope of recovery? >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rest mgmt api
On Mon, 11 Feb 2013, Gregory Farnum wrote: > [...] > ...but my instinct is to want one canonical code path in the monitors, > not two. Two allows for discrepancies in what each method allows to > [...] Yeah, I'm convinced. Just chatted with Dan and Josh a bit about this. Josh had the interesting idea that the specification of what commands are supported could be requested from the monitor is some canonical form (say, a blob of JSON), and then enforced at the client. That would be translated into an argparse config for the CLI, and a simple matching/validation table for the REST endpoint. That might be worth the complexity to get the best of both worlds... but first Dan is looking at whether Python's argparse will do everything we want for the CLI end of things. In the meantime, the first set of tasks still stand: move the ceph tool cruft into MonClient and Objecter and out of tool/common.cc for starters. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 02:47:13PM -0800, Gregory Farnum wrote: > On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf wrote: > > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: > > Yes, there is a dump of 100,000 events for this backtrace in the linked > > archive (I need 7 hours to upload it). > > Can you just pastebin the last couple hundred lines? I'm mostly > interested if there's anything from the function which actually caused > the assert/segfault. Also, the log should compress well and get much > smaller! Sent in pm. And yes, I have a good compression rate but... % ls -lh total 38G -rw-r--r-- 1 kdecherf kdecherf 3.3G Feb 11 18:36 cc-ceph-log.tar.gz -rw--- 1 kdecherf kdecherf 66M Feb 4 17:57 ceph.log -rw-r--r-- 1 kdecherf kdecherf 3.5G Feb 4 14:44 ceph-mds.b.log -rw-r--r-- 1 kdecherf kdecherf 31G Feb 5 15:55 ceph-mds.c.log -rw-r--r-- 1 kdecherf kdecherf 27M Feb 11 19:46 ceph-osd.14.log ;-) > > The distribution is heterogeneous: we have a folder of ~17G for 300k > > objects, another of ~2G for 150k objects and a lof of smaller directories. > > Sorry, you mean 300,000 files in the single folder? > If so, that's definitely why it's behaving so badly — your folder is > larger than your maximum cache size settings, and so if you run an > "ls" or anything the MDS will read the whole thing off disk, then > instantly drop most of the folder from its cache. Then re-read again > for the next request to list contents, etc etc. The biggest top-level folder contains 300k files but splitted into several subfolders (a subfolder does not contain more than 10,000 files at its level). > > Are you talking about the mds bal frag and mds bal split * settings? > > Do you have any advice about the value to use? > If you set "mds bal frag = true" in your config, it will split up > those very large directories into smaller fragments and behave a lot > better. This isn't quite as stable (thus the default to "off"), so if > you have the memory to just really up your cache size I'd start with > that and see if it makes your problems better. But if it doesn't, > directory fragmentation does work reasonably well and it's something > we'd be interested in bug reports for. :) I will try it, thanks! -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount cephfs - can't read superblock
On Feb 9, 2013, at 3:25 AM, Adam Nielsen wrote: > I will use that list as soon as it appears on GMane, since I find their NNTP > interface a lot easier than managing a bunch of mailing list subscriptions! > Maybe someone with more authority than myself can add it? > > http://gmane.org/subscribe.php Agree - requested it last week, will follow up when it's added. Cheers, Ross -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf wrote: > On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: >> On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf wrote: >> > References: >> > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html >> > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) >> > 1: /usr/bin/ceph-mds() [0x817e82] >> > 2: (()+0xf140) [0x7f9091d30140] >> > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] >> > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] >> > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] >> > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] >> > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] >> > 8: (Server::kill_session(Session*)+0x137) [0x549c67] >> > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] >> > 10: (MDS::tick()+0x338) [0x4da928] >> > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] >> > 12: (SafeTimerThread::entry()+0xd) [0x782bad] >> > 13: (()+0x7ddf) [0x7f9091d28ddf] >> > 14: (clone()+0x6d) [0x7f90909cc24d] >> >> This in particular is quite odd. Do you have any logging from when >> that happened? (Oftentimes the log can have a bunch of debugging >> information from shortly before the crash.) > > Yes, there is a dump of 100,000 events for this backtrace in the linked > archive (I need 7 hours to upload it). Can you just pastebin the last couple hundred lines? I'm mostly interested if there's anything from the function which actually caused the assert/segfault. Also, the log should compress well and get much smaller! >> On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf wrote: >> > Furthermore, I observe another strange thing more or less related to the >> > storms. >> > >> > During a rsync command to write ~20G of data on Ceph and during (and >> > after) the storm, one OSD sends a lot of data to the active MDS >> > (400Mbps peak each 6 seconds). After a quick check, I found that when I >> > stop osd.23, osd.14 stops its peaks. >> >> This is consistent with Sam's suggestion that MDS is thrashing its >> cache, and is grabbing a directory object off of the OSDs. How large >> are the directories you're using? If they're a significant fraction of >> your cache size, it might be worth enabling the (sadly less stable) >> directory fragmentation options, which will split them up into smaller >> fragments that can be independently read and written to disk. > > The distribution is heterogeneous: we have a folder of ~17G for 300k > objects, another of ~2G for 150k objects and a lof of smaller directories. Sorry, you mean 300,000 files in the single folder? If so, that's definitely why it's behaving so badly — your folder is larger than your maximum cache size settings, and so if you run an "ls" or anything the MDS will read the whole thing off disk, then instantly drop most of the folder from its cache. Then re-read again for the next request to list contents, etc etc. > Are you talking about the mds bal frag and mds bal split * settings? > Do you have any advice about the value to use? If you set "mds bal frag = true" in your config, it will split up those very large directories into smaller fragments and behave a lot better. This isn't quite as stable (thus the default to "off"), so if you have the memory to just really up your cache size I'd start with that and see if it makes your problems better. But if it doesn't, directory fragmentation does work reasonably well and it's something we'd be interested in bug reports for. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rest mgmt api
On Mon, Feb 11, 2013 at 2:00 PM, Sage Weil wrote: > On Mon, 11 Feb 2013, Gregory Farnum wrote: >> On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil wrote: >> > On Wed, 6 Feb 2013, Dimitri Maziuk wrote: >> >> On 02/06/2013 01:34 PM, Sage Weil wrote: >> >> >> >> > I think the one caveat here is that having a single registry for >> >> > commands >> >> > in the monitor means that commands can come in two flavors: >> >> > vector >> >> > (cli) and URL (presumably in json form). But a single command >> >> > dispatch/registry framework will make that distinction pretty simple... >> >> >> >> Any reason you can't have your CLI json-encode the commands (or, >> >> conversely, your cgi/wsgi/php/servlet URL handler decode them into >> >> vector) before passing them on to the monitor? >> > >> > We can, but they won't necessarily look the same, because it is unlikely >> > we can make a sane 1:1 translation of the CLI to REST that makes sense, >> > and it would be nice to avoid baking knowledge about the individual >> > commands into the client side. >> >> I disagree and am with Joao on this one ? the monitor parsing is >> ridiculous as it stand right now, and we should be trying to get rid >> of the manual string parsing. The monitors should be parsing JSON >> commands that are sent by the client; it makes validation and the > > No argument that the current parsing code is bad... > >> logic control flow a lot easier. We're going to want some level of >> intelligence in the clients so that they can tailor themselves to the >> appropriate UI conventions, and having two different parsing paths in > > What do you mean by tailor to UI conventions? Implementing and/or allowing positional versus named parameters, to toss off one suggestion. Obviously the CLI will want to allow input data in a format different than an API, but a port to a different platform might prefer named parameters instead of positional ones, or whatever. Basically I'm agreeing that we as users want to be able to input data differently and have it mean the same thing ;) >> the monitors is just asking for trouble: they will get out of sync and >> have different kinds of parsing errors. >> >> What we could do is have the monitors speak JSON only, and then give >> the clients a minimal intelligence so that the CLI could (for >> instance) prettify the options for commands it knows about, but still >> allow pass-through for access to newer commands it hasn't yet heard >> of. > > That doesn't really help; it means the mon still has to understand the > CLI grammar. > > What we are talking about is the difference between: > > [ 'osd', 'down', '123' ] > > and > > { > URI: '/osd/down', > OSD-Id: 123 > } > > or however we generically translate the HTTP request into JSON. Once we > normalize the code, calling it "parsing" is probably misleading. The top > (CLI) fragment will match against a rule like: > > [ STR("osd"), STR("down"), POSINT ] > > or however we encode the syntax, while the below would match against > > { .prefix = "/osd/down", >.fields = [ "OSD-Id": POSINT ] > } > > ..or something. I'm making this syntax up, but you get the idea: there > would be a strict format for the request and generic code that validates > it and passes the resulting arguments/matches into a function like > > int do_command_osd_down(int n); > > regardless of which type of input pattern it matched. ...but my instinct is to want one canonical code path in the monitors, not two. Two allows for discrepancies in what each method allows to come in that we're not going to have if they all come in to the monitor in a single form. So I say that the canonicalization should happen client-side, and the enforcement should happen server-side (and probably client-side as well, but that's just for courtesy). You've suggested that we want the monitors to do the parsing so that old clients will work, but given that new commands in the monitors often require new capabilities in the clients, having it be slightly more awkward to send new commands to new monitors from old clients doesn't seem like such a big deal to me — if somebody's running monitor version .64 and client ceph tool version .60 and wants to use a new thing, I don't feel bad about making them give the CLI a command which completely specifies what the JSON looks like, instead of using the pretty wrapping they'd get if they upgraded their client. Having a canonicalized format also means that when we return errors they can be a lot more useful, since the monitor can specify what fields it received and which ones were bad, instead of just outputting a string from whichever line of code actually broke. Consider an incoming command whose canonical form is [ 'crush', 'add', '123', '1.0' ] And the parsing code runs through that and it fails and the string going back says "error: does not specify weight!". But the user looks and says "yes I did, it's 1.0!" Versus if the error came back as "Received command: ['area': 'crush'
Re: rest mgmt api
On 02/11/2013 04:00 PM, Sage Weil wrote: > On Mon, 11 Feb 2013, Gregory Farnum wrote: ... > That doesn't really help; it means the mon still has to understand the > CLI grammar. > > What we are talking about is the difference between: > > [ 'osd', 'down', '123' ] > > and > > { > URI: '/osd/down', > OSD-Id: 123 > } > > or however we generically translate the HTTP request into JSON. I think the setup we have in mind is where the MON reads something like {"who:"osd", "which":"123", "what":"down", "when":"now"} from a socket (pipe, whatever), the CLI reads "osd down 123 now" from the prompt and pushes {"who:"osd", "which":"123", "what":"down", "when":"now"} into that socket, the webapp gets whatever: "/osd/down/123/now" or ?who=osd&command=down&id=123&when=now" from whoever impersonates the browser and pipes {"who:"osd", "which":"123", "what":"down", "when":"now"} into that same socket, and all three of them are three completely separate applications that don't try to do what they don't need to. > FWIW you could pass the CLI command as JSON, but that's no different than > encoding vector; it's still a different way to describing the same > command. The devil is of course in the details: in (e.g.) python json.loads() the string and gives you the map you could plug into a lookup table or something to get right to the function call. My c++ is way rusty, I've no idea what's available in boost &co -- if you have to roll your own json parser then you indeed don't care how that vector is encoded. -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
2013/2/12, Dave Chinner : > On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote: >> From: Namjae Jeon >> >> This patch is a follow up on below patch: >> >> [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type >> commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 > >> diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c >> index a836118..3391800 100644 >> --- a/fs/xfs/xfs_export.c >> +++ b/fs/xfs/xfs_export.c >> @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type) >> case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG: >> return 6; >> } >> -return 255; /* invalid */ >> +return FILEID_INVALID; /* invalid */ >> } > > I think you can drop the "/* invalid */" comment from there now as > it is redundant with this change. Okay, Thanks for review :-) > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: > On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf wrote: > > References: > > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html > > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > > 1: /usr/bin/ceph-mds() [0x817e82] > > 2: (()+0xf140) [0x7f9091d30140] > > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] > > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] > > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] > > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] > > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] > > 8: (Server::kill_session(Session*)+0x137) [0x549c67] > > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] > > 10: (MDS::tick()+0x338) [0x4da928] > > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] > > 12: (SafeTimerThread::entry()+0xd) [0x782bad] > > 13: (()+0x7ddf) [0x7f9091d28ddf] > > 14: (clone()+0x6d) [0x7f90909cc24d] > > This in particular is quite odd. Do you have any logging from when > that happened? (Oftentimes the log can have a bunch of debugging > information from shortly before the crash.) Yes, there is a dump of 100,000 events for this backtrace in the linked archive (I need 7 hours to upload it). > > On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf wrote: > > Furthermore, I observe another strange thing more or less related to the > > storms. > > > > During a rsync command to write ~20G of data on Ceph and during (and > > after) the storm, one OSD sends a lot of data to the active MDS > > (400Mbps peak each 6 seconds). After a quick check, I found that when I > > stop osd.23, osd.14 stops its peaks. > > This is consistent with Sam's suggestion that MDS is thrashing its > cache, and is grabbing a directory object off of the OSDs. How large > are the directories you're using? If they're a significant fraction of > your cache size, it might be worth enabling the (sadly less stable) > directory fragmentation options, which will split them up into smaller > fragments that can be independently read and written to disk. The distribution is heterogeneous: we have a folder of ~17G for 300k objects, another of ~2G for 150k objects and a lof of smaller directories. Are you talking about the mds bal frag and mds bal split * settings? Do you have any advice about the value to use? -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File exists not handled in 0.48argonaut1
Since the attachment didn't work, apparently, here is a link to the log: http://dl.dropbox.com/u/766198/error17.log.gz On Mon, Feb 11, 2013 at 1:42 PM, Samuel Just wrote: > I don't see the more complete log. > -Sam > > On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness > wrote: >> Anyone have any thoughts on this??? It looks like I may have to wipe >> out the OSDs effected and rebuild them, but I'm afraid that may result >> in data loss because of the old OSD first crush map in place :(. >> >> On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness >> wrote: >>> We ran into an error which appears very much like a bug fixed in 0.44. >>> >>> This cluster is running version: >>> >>> ceph version 0.48.1argonaut >>> (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) >>> >>> The error line is: >>> >>> Feb 8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682 >>> 7f40f9f08700 0 filestore(/mnt/osd97) error (17) File exists not >>> handled on operation 20 (11279344.0.0, or op 0, counting from 0) >>> >>> A more complete log is attached. >>> >>> First question: is this a know bug fixed in more recent versions? >>> >>> Second question: is there any hope of recovery? >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chain_fsetxattr extra chunk removal
Hi, I amended the unit tests ( https://github.com/ceph/ceph/pull/40/files ) to cover the code below. A review would be much appreciated :-) Cheers On 02/11/2013 09:08 PM, Loic Dachary wrote: > > > On 02/11/2013 06:13 AM, Yehuda Sadeh wrote: >> On Thu, Feb 7, 2013 at 12:59 PM, Loic Dachary wrote: >>> Hi, >>> >>> While writing unit tests for chain_xattr.cc I tried to understand how to >>> create the conditions to trigger this part of the chain_fsetxattr function: >>> >>> /* if we're exactly at a chunk size, remove the next one (if wasn't >>> removed >>> before) */ >>> if (ret >= 0 && chunk_size == CHAIN_XATTR_MAX_BLOCK_LEN) { >>> get_raw_xattr_name(name, i, raw_name, sizeof(raw_name)); >>> int r = sys_fremovexattr(fd, raw_name); >>> if (r < 0 && r != -ENODATA) >>> ret = r; >>> } >>> >>> I suspect this cleans up extra empty attributes created as a side effect of >>> a previous version of the function. Or I just don't understand the case it >>> addresses. >>> >>> I'd very much appreciate a hint :-) >>> >> >> Well, the code has changed a bit, but originally when a chain was >> overwritten we didn't bother to remove the xattrs tail. When we read >> the chain we stop either when we got a short xattr, or when the next >> xattr in the chain didn't exist. So when writing an xattr that was >> perfectly aligned with the block len we had to remove the next xattr >> in order make sure that readers will not over-read. I'm not too sure >> whether that still the case, Sam might have a better idea. >> In any case, it might be a good idea to test the case where we have a >> big xattr that spans across multiple blocks (e.g., > 3) and being >> overwritten by a short xattr. Probably also need to test it with >> different combinations of aligned and non-aligned block sizes. > > I understand now and I'll modify the pull request > https://github.com/ceph/ceph/pull/40 accordingly. > > Thanks :-) > >> >> Thanks, >> Yehuda >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: rest mgmt api
On Mon, 11 Feb 2013, Gregory Farnum wrote: > On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil wrote: > > On Wed, 6 Feb 2013, Dimitri Maziuk wrote: > >> On 02/06/2013 01:34 PM, Sage Weil wrote: > >> > >> > I think the one caveat here is that having a single registry for commands > >> > in the monitor means that commands can come in two flavors: > >> > vector > >> > (cli) and URL (presumably in json form). But a single command > >> > dispatch/registry framework will make that distinction pretty simple... > >> > >> Any reason you can't have your CLI json-encode the commands (or, > >> conversely, your cgi/wsgi/php/servlet URL handler decode them into > >> vector) before passing them on to the monitor? > > > > We can, but they won't necessarily look the same, because it is unlikely > > we can make a sane 1:1 translation of the CLI to REST that makes sense, > > and it would be nice to avoid baking knowledge about the individual > > commands into the client side. > > I disagree and am with Joao on this one ? the monitor parsing is > ridiculous as it stand right now, and we should be trying to get rid > of the manual string parsing. The monitors should be parsing JSON > commands that are sent by the client; it makes validation and the No argument that the current parsing code is bad... > logic control flow a lot easier. We're going to want some level of > intelligence in the clients so that they can tailor themselves to the > appropriate UI conventions, and having two different parsing paths in What do you mean by tailor to UI conventions? > the monitors is just asking for trouble: they will get out of sync and > have different kinds of parsing errors. > > What we could do is have the monitors speak JSON only, and then give > the clients a minimal intelligence so that the CLI could (for > instance) prettify the options for commands it knows about, but still > allow pass-through for access to newer commands it hasn't yet heard > of. That doesn't really help; it means the mon still has to understand the CLI grammar. What we are talking about is the difference between: [ 'osd', 'down', '123' ] and { URI: '/osd/down', OSD-Id: 123 } or however we generically translate the HTTP request into JSON. Once we normalize the code, calling it "parsing" is probably misleading. The top (CLI) fragment will match against a rule like: [ STR("osd"), STR("down"), POSINT ] or however we encode the syntax, while the below would match against { .prefix = "/osd/down", .fields = [ "OSD-Id": POSINT ] } ..or something. I'm making this syntax up, but you get the idea: there would be a strict format for the request and generic code that validates it and passes the resulting arguments/matches into a function like int do_command_osd_down(int n); regardless of which type of input pattern it matched. Obviously we'll need 100% testing coverage for both the RESTful and CLI variants, whether we do the above or whether the CLI is translating one into the other via duplicated knowledge of the command set. FWIW you could pass the CLI command as JSON, but that's no different than encoding vector; it's still a different way to describing the same command. If the parsing code is wrapping in a single library that validates typed fields or positional arguments/flags, I don't think this is going to turn into anything remotely like the same wild-west horror that the current code represents. And if we were building this from scratch with no legacy, I'd argue that the same model is still pretty good... unless we recast the entire CLI in terms of a generic URI+field model that matches the REST API perfectly. Now.. if that is the route want to go, that is another choice. We could: - redesign a fresh CLI with commands like ceph /osd/123 mark=down ceph /pool/foo create pg_num=123 - make this a programmatic transformation to/from a REST request, like /osd/123?command=mark&status=down /pool/foo?command=create&pg_num=123 (or whatever the REST requests are "supposed" to look like) - hard-code a client-side mapping for legacy commands only - only add new commands in the new syntax That means retraining users and only adding new commands in the new model of things. And dreaming up said model... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File exists not handled in 0.48argonaut1
I don't see the more complete log. -Sam On Mon, Feb 11, 2013 at 11:12 AM, Mandell Degerness wrote: > Anyone have any thoughts on this??? It looks like I may have to wipe > out the OSDs effected and rebuild them, but I'm afraid that may result > in data loss because of the old OSD first crush map in place :(. > > On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness > wrote: >> We ran into an error which appears very much like a bug fixed in 0.44. >> >> This cluster is running version: >> >> ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) >> >> The error line is: >> >> Feb 8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682 >> 7f40f9f08700 0 filestore(/mnt/osd97) error (17) File exists not >> handled on operation 20 (11279344.0.0, or op 0, counting from 0) >> >> A more complete log is attached. >> >> First question: is this a know bug fixed in more recent versions? >> >> Second question: is there any hope of recovery? > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
On Mon, Feb 11, 2013 at 05:25:58PM +0900, Namjae Jeon wrote: > From: Namjae Jeon > > This patch is a follow up on below patch: > > [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type > commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 > diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c > index a836118..3391800 100644 > --- a/fs/xfs/xfs_export.c > +++ b/fs/xfs/xfs_export.c > @@ -48,7 +48,7 @@ static int xfs_fileid_length(int fileid_type) > case FILEID_INO32_GEN_PARENT | XFS_FILEID_TYPE_64FLAG: > return 6; > } > - return 255; /* invalid */ > + return FILEID_INVALID; /* invalid */ > } I think you can drop the "/* invalid */" comment from there now as it is redundant with this change. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD Weights
On Mon, Feb 11, 2013 at 12:43 PM, Holcombe, Christopher wrote: > Hi Everyone, > > I just wanted to confirm my thoughts on the ceph osd weightings. My > understanding is they are a statistical distribution number. My current > setup has 3TB hard drives and they all have the default weight of 1. I was > thinking that if I mixed in 4TB hard drives in the future it would only put > 3TB of data on them. I thought if I changed the weight to 3 for the 3TB hard > drives and 4 for the 4TB hard drives it would correctly use the larger > storage disks. Is that correct? Yep, looks good. -Greg PS: This is a good question for the new ceph-users list. (http://ceph.com/community/introducing-ceph-users/) :) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
OSD Weights
Hi Everyone, I just wanted to confirm my thoughts on the ceph osd weightings. My understanding is they are a statistical distribution number. My current setup has 3TB hard drives and they all have the default weight of 1. I was thinking that if I mixed in 4TB hard drives in the future it would only put 3TB of data on them. I thought if I changed the weight to 3 for the 3TB hard drives and 4 for the 4TB hard drives it would correctly use the larger storage disks. Is that correct? Thanks, Chris NOTICE: This e-mail and any attachments is intended only for use by the addressee(s) named herein and may contain legally privileged, proprietary or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, is strictly prohibited. If you receive this email in error please immediately notify me via reply email or at (800) 927-9800 and permanently delete the original copy and any copy of any e-mail, and any printout. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster
jIsaac, I'm sorry I haven't been able to wrangle any time to look into this more yet, but Sage pointed out in a related thread that there might be some buggy handling of things like this if the OSD and the monitor are located on the same host. Am I correct in assuming that with your small cluster, all your OSDs are co-located with a monitor daemon? -Greg On Mon, Jan 28, 2013 at 12:17 PM, Isaac Otsiabah wrote: > > > Gregory, i recreated the osd down problem again this morning on two nodes > (g13ct, g14ct). First, i created a 1-node cluster on g13ct (with osd.0, 1 ,2) > and then added host g14ct (osd3. 4, 5). osd.1 went down for about 1 minute > and half after adding osd 3, 4, 5 were adde4d. i have included the routing > table of each node at the time osd.1 went down. ceph.conf and ceph-osd.1.log > files are attached. The crush map was default. Also, it could be a timing > issue because it does not always fail when using default crush map, it takes > several trials before you see it. Thank you. > > > [root@g13ct ~]# netstat -r > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > default 133.164.98.250 0.0.0.0 UG0 0 0 eth2 > 133.164.98.0* 255.255.255.0 U 0 0 0 eth2 > link-local * 255.255.0.0 U 0 0 0 eth3 > link-local * 255.255.0.0 U 0 0 0 eth0 > link-local * 255.255.0.0 U 0 0 0 eth2 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth0 > 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 > 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 > [root@g13ct ~]# ceph osd tree > > # idweight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > > > [root@g14ct ~]# ceph osd tree > > # idweight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > [root@g14ct ~]# netstat -r > Kernel IP routing table > Destination Gateway Genmask Flags MSS Window irtt Iface > default 133.164.98.250 0.0.0.0 UG0 0 0 eth0 > 133.164.98.0* 255.255.255.0 U 0 0 0 eth0 > link-local * 255.255.0.0 U 0 0 0 eth3 > link-local * 255.255.0.0 U 0 0 0 eth5 > link-local * 255.255.0.0 U 0 0 0 eth0 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth3 > 192.0.0.0 * 255.0.0.0 U 0 0 0 eth5 > 192.168.0.0 * 255.255.255.0 U 0 0 0 eth3 > 192.168.1.0 * 255.255.255.0 U 0 0 0 eth5 > [root@g14ct ~]# ceph osd tree > > # idweight type name up/down reweight > -1 6 root default > -3 6 rack unknownrack > -2 3 host g13ct > 0 1 osd.0 up 1 > 1 1 osd.1 down1 > 2 1 osd.2 up 1 > -4 3 host g14ct > 3 1 osd.3 up 1 > 4 1 osd.4 up 1 > 5 1 osd.5 up 1 > > > > > > Isaac > > > > > > > > > > > - Original Message - > From: Isaac Otsiabah > To: Gregory Farnum > Cc: "ceph-devel@vger.kernel.org" > Sent: Friday, January 25, 2013 9:51 AM > Subject: Re: osd down (for 2 about 2 minutes) error after adding a new host > to my cluster > > > > Gregory, the network physical layout is simple, the two networks are > separate. the 192.168.0 and the 192.168.1 are not subnets within a
Re: Crash and strange things on MDS
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf wrote: > References: > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: /usr/bin/ceph-mds() [0x817e82] > 2: (()+0xf140) [0x7f9091d30140] > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] > 8: (Server::kill_session(Session*)+0x137) [0x549c67] > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] > 10: (MDS::tick()+0x338) [0x4da928] > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] > 12: (SafeTimerThread::entry()+0xd) [0x782bad] > 13: (()+0x7ddf) [0x7f9091d28ddf] > 14: (clone()+0x6d) [0x7f90909cc24d] This in particular is quite odd. Do you have any logging from when that happened? (Oftentimes the log can have a bunch of debugging information from shortly before the crash.) On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf wrote: > Furthermore, I observe another strange thing more or less related to the > storms. > > During a rsync command to write ~20G of data on Ceph and during (and > after) the storm, one OSD sends a lot of data to the active MDS > (400Mbps peak each 6 seconds). After a quick check, I found that when I > stop osd.23, osd.14 stops its peaks. This is consistent with Sam's suggestion that MDS is thrashing its cache, and is grabbing a directory object off of the OSDs. How large are the directories you're using? If they're a significant fraction of your cache size, it might be worth enabling the (sadly less stable) directory fragmentation options, which will split them up into smaller fragments that can be independently read and written to disk. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount cephfs - can't read superblock
On Sat, Feb 9, 2013 at 2:13 PM, Adam Nielsen wrote: $ ceph -s health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0, quorum 0 0 osdmap e3: 1 osds: 1 up, 1 in pgmap v119: 192 pgs: 192 active+degraded; 0 bytes data, 10204 MB used, 2740 GB / 2750 GB avail mdsmap e1: 0/0/1 up >>> >> In any case, this output indicates that your MDS isn't actually running, >> Adam, or at least isn't connected. Check and see if the process is still >> going? >> You should also have minimal logging by default in /var/lib/ceph/mds*; you >> might find some output there that could be useful. > > The MDS appears to be running: > > $ ps -A | grep ceph > 12903 ?00:00:17 ceph-mon > 12966 ?00:00:10 ceph-mds > 13047 ?00:00:31 ceph-osd > > And I found some logs in /var/log/ceph: > > $ cat /var/log/ceph/ceph-mds.0.log > 2013-02-10 07:57:16.505842 b4aa3b70 0 mds.-1.0 ms_handle_connect on > 192.168.0.6:6789/0 > > So it appears the mds is running. Wireshark shows some traffic going between > hosts when the mount request comes through, but then the responses stop and > the client eventually gives up and the mount fails. > >>> You better add a second OSD or just do a mkcephfs again with a second >>> OSD in the configuration. > > I just tried this and it fixed the unclean pgs issue, but I still can't mount > a cephfs filesystem: > > $ ceph -s >health HEALTH_OK >monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0, quorum 0 0 >osdmap e5: 2 osds: 2 up, 2 in > pgmap v107: 384 pgs: 384 active+clean; 0 bytes data, 40423 MB used, 5461 > GB / 5501 GB avail >mdsmap e1: 0/0/1 up > > remote$ mount -t ceph 192.168.0.6:6789:/ /mnt/ceph/ > mount: 192.168.0.6:6789:/: can't read superblock > > Running the mds daemon in debug mode says this: > > ... > 2013-02-10 08:07:03.550977 b2a83b70 10 mds.-1.0 MDS::ms_get_authorizer > type=mon > 2013-02-10 08:07:03.551840 b4a87b70 0 mds.-1.0 ms_handle_connect on > 192.168.0.6:6789/0 > 2013-02-10 08:07:03.555307 b738c710 10 mds.-1.0 beacon_send up:boot seq 1 > (currently up:boot) > 2013-02-10 08:07:03.555629 b738c710 10 mds.-1.0 create_logger > 2013-02-10 08:07:03.564138 b4a87b70 5 mds.-1.0 handle_mds_map epoch 1 from > mon.0 > 2013-02-10 08:07:03.564348 b4a87b70 10 mds.-1.0 my compat > compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object} > 2013-02-10 08:07:03.564454 b4a87b70 10 mds.-1.0 mdsmap compat > compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object} > 2013-02-10 08:07:03.564547 b4a87b70 10 mds.-1.-1 map says i am > 192.168.0.6:6800/16077 mds.-1.-1 state down:dne > 2013-02-10 08:07:03.564654 b4a87b70 10 mds.-1.-1 not in map yet > 2013-02-10 08:07:07.67 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 2 > (currently down:dne) > 2013-02-10 08:07:11.555858 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 3 > (currently down:dne) > 2013-02-10 08:07:15.556123 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 4 > (currently down:dne) > 2013-02-10 08:07:19.556411 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 5 > (currently down:dne) > 2013-02-10 08:07:23.556654 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 6 > (currently down:dne) > 2013-02-10 08:07:27.556931 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 7 > (currently down:dne) > 2013-02-10 08:07:31.557189 b2881b70 10 mds.-1.-1 beacon_send up:boot seq 8 > (currently down:dne) > ... How bizarre. That indicates the MDS is running and is requesting to become active, but the monitor for some reason isn't letting it in. Can you restart your monitor with logging on as well (--debug_mon 20 on the end of the command line, or "debug mon = 20" in the config) and then try again? The other possibility is that maybe your MDS doesn't have the right access permissions; does "ceph auth list" include an MDS, and does it have any permissions associated? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: preferred OSD
On Fri, Feb 8, 2013 at 4:45 PM, Sage Weil wrote: > Hi Marcus- > > On Fri, 8 Feb 2013, Marcus Sorensen wrote: >> I know people have been disscussing on and off about providing a >> "preferred OSD" for things like multi-datacenter, or even within a >> datacenter, choosing an OSD that would avoid traversing uplinks. Has >> there been any discussion on how to do this? I seem to remember people >> saying things like 'the crush map doesn't work that way at the >> moment'. Presumably, when a client needs to access an object, it looks >> up where the object should be stored via the crush map, which returns >> all OSDs that could be read from. Exactly. >> I was thinking this morning that you >> could potentially leave the crush map out of it, by setting a location >> for each OSD in the ceph.conf, and an /etc/ceph/location file for the >> client. Then use the absolute value of the difference to determine >> preferred OSD. So, if OSD0 was location=1, and OSD1 was location=3, >> and client 1 was location=2, then it would do the normal thing, but if >> client 1 was location=1.3, then it would prefer OSD0 for reads. >> Perhaps that's overly simplistic and wouldn't scale to meet everyone's >> requirements, but you could do multiple locations and sprinkle clients >> in between them all in various ways. Or perhaps the location is a >> matrix, so you could literally map it out on a grid with a set of >> coordinates. What ideas are being discussed around how to implement >> this? > > We can do something like this for reads today, where we pick a read > replica based on the closest IP or some other metric/mask. We generally > don't enable this because it leads to non-optimal cache behavior, but it > could in principle be enabled via a config option for certain clusters > (and in fact some of that code is already in place). Just to be specific — there are currently flags which will let the client read from local-host if it can figure that out, and those aren't heavily-tested but do work when we turn them on. Other metrics of "close" don't appear yet, though. In general, CRUSH locations seem like a good measure of closeness that the client could rely on, rather than a separate "location" value, but it does restrict the usefulness if you've configured multiple CRUSH root nodes. I think it would need to support a tree of some kind though, rather than just a linear value. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: chain_fsetxattr extra chunk removal
On 02/11/2013 06:13 AM, Yehuda Sadeh wrote: > On Thu, Feb 7, 2013 at 12:59 PM, Loic Dachary wrote: >> Hi, >> >> While writing unit tests for chain_xattr.cc I tried to understand how to >> create the conditions to trigger this part of the chain_fsetxattr function: >> >> /* if we're exactly at a chunk size, remove the next one (if wasn't removed >> before) */ >> if (ret >= 0 && chunk_size == CHAIN_XATTR_MAX_BLOCK_LEN) { >> get_raw_xattr_name(name, i, raw_name, sizeof(raw_name)); >> int r = sys_fremovexattr(fd, raw_name); >> if (r < 0 && r != -ENODATA) >> ret = r; >> } >> >> I suspect this cleans up extra empty attributes created as a side effect of >> a previous version of the function. Or I just don't understand the case it >> addresses. >> >> I'd very much appreciate a hint :-) >> > > Well, the code has changed a bit, but originally when a chain was > overwritten we didn't bother to remove the xattrs tail. When we read > the chain we stop either when we got a short xattr, or when the next > xattr in the chain didn't exist. So when writing an xattr that was > perfectly aligned with the block len we had to remove the next xattr > in order make sure that readers will not over-read. I'm not too sure > whether that still the case, Sam might have a better idea. > In any case, it might be a good idea to test the case where we have a > big xattr that spans across multiple blocks (e.g., > 3) and being > overwritten by a short xattr. Probably also need to test it with > different combinations of aligned and non-aligned block sizes. I understand now and I'll modify the pull request https://github.com/ceph/ceph/pull/40 accordingly. Thanks :-) > > Thanks, > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: rest mgmt api
On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil wrote: > On Wed, 6 Feb 2013, Dimitri Maziuk wrote: >> On 02/06/2013 01:34 PM, Sage Weil wrote: >> >> > I think the one caveat here is that having a single registry for commands >> > in the monitor means that commands can come in two flavors: vector >> > (cli) and URL (presumably in json form). But a single command >> > dispatch/registry framework will make that distinction pretty simple... >> >> Any reason you can't have your CLI json-encode the commands (or, >> conversely, your cgi/wsgi/php/servlet URL handler decode them into >> vector) before passing them on to the monitor? > > We can, but they won't necessarily look the same, because it is unlikely > we can make a sane 1:1 translation of the CLI to REST that makes sense, > and it would be nice to avoid baking knowledge about the individual > commands into the client side. I disagree and am with Joao on this one — the monitor parsing is ridiculous as it stand right now, and we should be trying to get rid of the manual string parsing. The monitors should be parsing JSON commands that are sent by the client; it makes validation and the logic control flow a lot easier. We're going to want some level of intelligence in the clients so that they can tailor themselves to the appropriate UI conventions, and having two different parsing paths in the monitors is just asking for trouble: they will get out of sync and have different kinds of parsing errors. What we could do is have the monitors speak JSON only, and then give the clients a minimal intelligence so that the CLI could (for instance) prettify the options for commands it knows about, but still allow pass-through for access to newer commands it hasn't yet heard of. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File exists not handled in 0.48argonaut1
Anyone have any thoughts on this??? It looks like I may have to wipe out the OSDs effected and rebuild them, but I'm afraid that may result in data loss because of the old OSD first crush map in place :(. On Fri, Feb 8, 2013 at 1:36 PM, Mandell Degerness wrote: > We ran into an error which appears very much like a bug fixed in 0.44. > > This cluster is running version: > > ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) > > The error line is: > > Feb 8 18:50:07 192.168.8.14 ceph-osd: 2013-02-08 18:50:07.545682 > 7f40f9f08700 0 filestore(/mnt/osd97) error (17) File exists not > handled on operation 20 (11279344.0.0, or op 0, counting from 0) > > A more complete log is attached. > > First question: is this a know bug fixed in more recent versions? > > Second question: is there any hope of recovery? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 11:00:15AM -0600, Sam Lang wrote: > Hi Kevin, sorry for the delayed response. > This looks like the mds cache is thrashing quite a bit, and with > multiple MDSs the tree partitioning is causing those estale messages. > In your case, you should probably run with just a single active mds (I > assume all three MDSs are active, but ceph -s will tell you for sure), > and the others as standby. I don't think you'll be able to do that > without starting over though. Hi Sam, I know that MDS clustering is a bit buggy so I have only one active MDS on this cluster. Here is the output of ceph -s: ~ # ceph -s health HEALTH_OK monmap e1: 3 mons at {a=x:6789/0,b=y:6789/0,c=z:6789/0}, election epoch 48, quorum 0,1,2 a,b,c osdmap e79: 27 osds: 27 up, 27 in pgmap v895343: 5376 pgs: 5376 active+clean; 18987 MB data, 103 GB used, 21918 GB / 23201 GB avail mdsmap e73: 1/1/1 up {0=b=up:active}, 2 up:standby > Also, you might want to increase the size of the mds cache if you have > enough memory on that machine. mds cache size defaults to 100k, you > might increase it to 300k and see if you get the same problems. I have 24GB of memory for each MDS, I will try to increase this value. Thanks for advice. > Do you have debug logging enabled when you see this crash? Can you > compress that mds log and post it somewhere or email it to me? Yes, I have 34GB of raw logs (for this issue) but I have no debug log of the beginning of the storm itself. I will upload a compressed archive. Furthermore, I observe another strange thing more or less related to the storms. During a rsync command to write ~20G of data on Ceph and during (and after) the storm, one OSD sends a lot of data to the active MDS (400Mbps peak each 6 seconds). After a quick check, I found that when I stop osd.23, osd.14 stops its peaks. I will forward a copy of the debug enabled log of osd14. The only significant difference between osd.23 and others is the list of hb_in where osd.14 is missing (but I think it's unrelated). ~ # ceph pg dump osdstat kbused kbavail kb hb in hb out 0 4016228 851255948 901042464 [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 1 4108748 851163428 901042464 [0,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26] [] 2 4276584 850995592 901042464 [0,1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 3 3997368 851274808 901042464 [0,1,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 4 4358212 850913964 901042464 [0,1,2,3,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 5 4039112 851233064 901042464 [0,1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 6 3971568 851300608 901042464 [0,1,2,3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 7 3942556 851329620 901042464 [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 8 4275584 850996592 901042464 [0,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 9 4279308 850992868 901042464 [0,1,2,3,4,5,6,7,8,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 10 3728136 851544040 901042464 [0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 11 3934096 851338080 901042464 [0,1,2,3,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 12 3991600 851280576 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 13 4211228 851060948 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,18,19,20,21,22,23,24,25,26] [] 14 4169476 851102700 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26] [] 15 4385584 850886592 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,16,17,18,19,20,21,22,23,24,25,26] [] 16 3761176 851511000 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26] [] 17 3646096 851626080 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18,19,20,21,22,23,24,25,26] [] 18 4119448 851152728 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26] [] 19 4592992 850679184 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26] [] 20 3740840 851531336 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,23,24,25,26] [] 21 4363552 850908624 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26] [] 22 3831420 851440756 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26] [] 23 3681648 851590528 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26] [] 24 3946192 851325984 901042464 [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,25,26] []
Re: Crash and strange things on MDS
On Mon, Feb 11, 2013 at 7:05 AM, Kevin Decherf wrote: > On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote: >> Hey everyone, >> >> It's my first post here to expose a potential issue I found today using >> Ceph 0.56.1. >> >> The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. >> All nodes are running Exherbo (source-based distribution) with Ceph >> 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which >> is mounted on ~60 clients (increasing each day). Objects are replicated >> three times and the cluster handles only 7GB of data atm for 350k >> objects. >> >> In certain conditions (I don't know them atm), some clients hang, >> generate CPU overloads (kworker) and are unable to make any IO on >> Ceph. The active MDS have ~20Mbps in/out during the issue (less than >> 2Mbps in normal activity). I don't know if it's directly linked but we >> also observe a lot of missing files at the same time. >> >> The problem is similar to this one [1]. >> >> A restart of the client or the MDS was enough before today, but we found >> a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours >> with ~25% clients hanging. >> >> In logs I found a segfault with this backtrace [2] and 100,000 dumped >> events during the first hang. We observed another hang which produces >> lot of these events (in debug mode): >>- "mds.0.server FAIL on ESTALE but attempting recovery" >>- "mds.0.server reply_request -116 (Stale NFS file handle) >> client_request(client.10991:1031 getattr As #104bab0 >> RETRY=132)" Hi Kevin, sorry for the delayed response. This looks like the mds cache is thrashing quite a bit, and with multiple MDSs the tree partitioning is causing those estale messages. In your case, you should probably run with just a single active mds (I assume all three MDSs are active, but ceph -s will tell you for sure), and the others as standby. I don't think you'll be able to do that without starting over though. Also, you might want to increase the size of the mds cache if you have enough memory on that machine. mds cache size defaults to 100k, you might increase it to 300k and see if you get the same problems. >> >> We have no profiling tools available on these nodes, and I don't know >> what I should search in the 35 GB log file. >> >> Note: the segmentation fault occured only once but the problem was >> observed four times on this cluster. >> >> Any help may be appreciated. >> >> References: >> [1] http://www.spinics.net/lists/ceph-devel/msg04903.html >> [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) >> 1: /usr/bin/ceph-mds() [0x817e82] >> 2: (()+0xf140) [0x7f9091d30140] >> 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] >> 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] >> 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] >> 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] >> 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] >> 8: (Server::kill_session(Session*)+0x137) [0x549c67] >> 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] >> 10: (MDS::tick()+0x338) [0x4da928] >> 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] >> 12: (SafeTimerThread::entry()+0xd) [0x782bad] >> 13: (()+0x7ddf) [0x7f9091d28ddf] >> 14: (clone()+0x6d) [0x7f90909cc24d] > > I found a possible cause/way to reproduce this issue. > We have now ~90 clients for 18GB / 650k objects and the storm occurs > when we execute an "intensive IO" command (tar of the whole pool / rsync > in one folder) on one of our client (the only which uses ceph-fuse, > don't know if it's limited to it or not). Do you have debug logging enabled when you see this crash? Can you compress that mds log and post it somewhere or email it to me? Thanks, -sam > > Any idea? > > Cheers, > -- > Kevin Decherf - @Kdecherf > GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F > http://kdecherf.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] snapshot, clone and mount a VM-Image
On Mon, 11 Feb 2013, Wolfgang Hennerbichler wrote: > > > On 02/11/2013 03:02 PM, Wido den Hollander wrote: > > > You are looking at a way to "extract" the snapshot, correct? > > No. > > > Why would > > you want to mount it and backup the files? > > because then I can do things like incremental backups. There will be a > ceph cluster at an ISP soon, who hosts various services on various VMs, > and it is important that the mailspool for example is backed up > efficiently, because it's huge and the number of files is also high. Note that an alternative way to approach incremental backups is at the block device level. We plan to implement an incremental backup function for the relative change between two snapshots (or a snapshot and the head). It's O(n) the size of the device vs the number of files, but should be more efficient for all but the most sparse of images. The implementation should be simple; the challenge is mostly around the incremental file format, probably. That doesn't help you now, but would be a relatively self-contained piece of functionality for someone to contribute to RBD. This isn't a top priority yet, so it will be a while before the inktank devs can get to it. sage > > > Couldn't you better handle this in the Virtual Machine itself? > > not really. open, changing files, a lot of virtual machines that one > needs to take care of, and so on. > > > If you want to backup the virtual machines to an extern location you > > could use either "rbd" or "qemu-img" to get the snapshot out of the Ceph > > cluster: > > > > $ rbd export --snap > > > > Or use qemu-img > > > > $ qemu-img convert -f raw -O qcow2 -s rbd:rbd/ > > .qcow2 > > > > You then get files which you can backup externally. > > > > Would that work? > > sure, but this is a very inefficient way of backing things up, because > one would back up on block level. I want to back up on filesystem level. > > > Wido > > > >> thanks a lot for you answers > >> Wolfgang > >> > > > > > > > -- > DI (FH) Wolfgang Hennerbichler > Software Development > Unit Advanced Computing Technologies > RISC Software GmbH > A company of the Johannes Kepler University Linz > > IT-Center > Softwarepark 35 > 4232 Hagenberg > Austria > > Phone: +43 7236 3343 245 > Fax: +43 7236 3343 250 > wolfgang.hennerbich...@risc-software.at > http://www.risc-software.at > ___ > ceph-users mailing list > ceph-us...@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPv6 address confusion in OSDs
On Mon, 11 Feb 2013, Simon Leinen wrote: > Sage Weil writes: > > On Mon, 11 Feb 2013, Simon Leinen wrote: > >> We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible. > > I should have mentioned that this is under Ubuntu 12.10 with version > 0.56.1-1quantal of the ceph packages. Sorry about the omission. > > >> Today I noticed this error message from an OSD just after I restarted > >> it (in an attempt to resolve an issue with some "stuck" pgs that > >> included that OSD): > >> > >> 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr > >> ([2001:620:0:6::106]:6822/1990 != my > >> [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990) > >> > >> These two addresses belong to the same interface: > >> > >> root@h1:~# ip -6 addr list dev vlan301 > >> 7: vlan301@bond0: mtu 1500 > >> inet6 2001:620:0:6::106/64 scope global > >> valid_lft forever preferred_lft forever > >> inet6 fe80::67d:7bff:fef1:78b/64 scope link > >> valid_lft forever preferred_lft forever > >> > >> 2001:620:... is the global-scope address, and this is how OSDs are > >> addressed in our ceph.conf. fe80:... is the link-local address that > >> every IPv6 interface has. Shouldn't these be treated as equivalent? > > > Is this OSD by chance sharing a host with one of the monitors? > > Yes, indeed! We have five monitors, i.e. every other server runs a > ceph-mon in addition to the 4-9 ceph-osd processes each server has. > This (h1) is one of the servers that has both. > > > The 'my address' value is learned by looking at the socket we connect to > > the monitor with... > > Thanks for the hint! I'll look at the code and try to understand > what's happening and how this could be avoided. > > The cluster seems to have recovered from this particular error by > itself. That makes sense if the trigger here is it random choosely to connect to the local monitor first and learning the address that way. Adding 'debug ms = 20' to your ceph.conf may give a hint.. looked for a 'learned by addr' message (or somethign similar) right at startup time. > But in general, when I reboot servers, there's often some pgs > that remain stuck, and I have to restart some OSDs until ceph -w shows > everything as "active+clean". Note that 'ceph osd down NN' may give similar results as restarting the daemon. > (Our network setup is somewhat complex, with IPv6 over VLANs over > "bonded" 10GEs redundantly connected to a pair of Brocade switches > running VLAG (something like multi-chassis Etherchannel). So it's > possible that there are some connectivity issues hiding somewhere.) Let us know what you find! sage > -- > Simon. > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] kv_flat_btree_async.cc: use vector instead of VLA's
On Mon, 11 Feb 2013, Danny Al-Gaaf wrote: > Am 10.02.2013 06:57, schrieb Sage Weil: > > On Thu, 7 Feb 2013, Danny Al-Gaaf wrote: > >> Fix "variable length array of non-POD element type" errors caused by > >> using librados::ObjectWriteOperation VLAs. (-Wvla) > >> > >> Signed-off-by: Danny Al-Gaaf > >> --- > >> src/key_value_store/kv_flat_btree_async.cc | 14 +++--- > >> 1 file changed, 7 insertions(+), 7 deletions(-) > >> > >> diff --git a/src/key_value_store/kv_flat_btree_async.cc > >> b/src/key_value_store/kv_flat_btree_async.cc > >> index 96c6cb0..4342e70 100644 > >> --- a/src/key_value_store/kv_flat_btree_async.cc > >> +++ b/src/key_value_store/kv_flat_btree_async.cc > >> @@ -1119,9 +1119,9 @@ int KvFlatBtreeAsync::cleanup(const index_data > >> &idata, const int &errno) { > >> //all changes were created except for updating the index and possibly > >> //deleting the objects. roll forward. > >> vector, librados::ObjectWriteOperation*> > ops; > >> -librados::ObjectWriteOperation owos[idata.to_delete.size() + 1]; > >> +vector owos(idata.to_delete.size() + > >> 1); > > > > I haven't read much of the surrounding code, but from what is included > > here I don't think this is equivalent... these are just null pointers > > initially, and so > > > >> for (int i = 0; i <= (int)idata.to_delete.size(); ++i) { > >> - ops.push_back(make_pair(pair(0, ""), &owos[i])); > >> + ops.push_back(make_pair(pair(0, ""), owos[i])); > > > > this doesn't do anything useful... owos[i] may as well be NULL. Why not > > make it > > > > vector owos(...) > > > > ? > > Because this would lead to a linker error: > > kv_flat_btree_async.o: In function `void > std::__uninitialized_fill_n::__uninit_fill_n unsigned long, > librados::ObjectWriteOperation>(librados::ObjectWriteOperation*, > unsigned long, librados::ObjectWriteOperation const&)': > /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188: > undefined reference to > `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation > const&)' > /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188: > undefined reference to > `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation > const&)' > > > Because in src/include/rados/librados.hpp > librados::ObjectOperation::ObjectOperation(librados::ObjectOperation > const&) was is defined, but not implemented in the librados.cc. > > Not sure if removing ObjectOperation(librados::ObjectOperation const&) > is the way to go here. Oh, I see... yeah, we shouldn't remove that. Probably we should restructure the code to use a list<>, which doesn't require a copy constructor or assignment operator. Note that this particular code shouldn't hold up the rest of the patches, since it's not being used by anything (yet!). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPv6 address confusion in OSDs
Sage Weil writes: > On Mon, 11 Feb 2013, Simon Leinen wrote: >> We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible. I should have mentioned that this is under Ubuntu 12.10 with version 0.56.1-1quantal of the ceph packages. Sorry about the omission. >> Today I noticed this error message from an OSD just after I restarted >> it (in an attempt to resolve an issue with some "stuck" pgs that >> included that OSD): >> >> 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr >> ([2001:620:0:6::106]:6822/1990 != my >> [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990) >> >> These two addresses belong to the same interface: >> >> root@h1:~# ip -6 addr list dev vlan301 >> 7: vlan301@bond0: mtu 1500 >> inet6 2001:620:0:6::106/64 scope global >> valid_lft forever preferred_lft forever >> inet6 fe80::67d:7bff:fef1:78b/64 scope link >> valid_lft forever preferred_lft forever >> >> 2001:620:... is the global-scope address, and this is how OSDs are >> addressed in our ceph.conf. fe80:... is the link-local address that >> every IPv6 interface has. Shouldn't these be treated as equivalent? > Is this OSD by chance sharing a host with one of the monitors? Yes, indeed! We have five monitors, i.e. every other server runs a ceph-mon in addition to the 4-9 ceph-osd processes each server has. This (h1) is one of the servers that has both. > The 'my address' value is learned by looking at the socket we connect to > the monitor with... Thanks for the hint! I'll look at the code and try to understand what's happening and how this could be avoided. The cluster seems to have recovered from this particular error by itself. But in general, when I reboot servers, there's often some pgs that remain stuck, and I have to restart some OSDs until ceph -w shows everything as "active+clean". (Our network setup is somewhat complex, with IPv6 over VLANs over "bonded" 10GEs redundantly connected to a pair of Brocade switches running VLAG (something like multi-chassis Etherchannel). So it's possible that there are some connectivity issues hiding somewhere.) -- Simon. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client can't reboot when rbd volume is mounted.
On Mon, 11 Feb 2013, Roman Alekseev wrote: > On 11.02.2013 09:36, Sage Weil wrote: > > On Mon, 11 Feb 2013, Roman Alekseev wrote: > > > Hi, > > > > > > When I try to reboot a client server without unmounting of rbd volume > > > manually > > > its services stop working but server doesn't reboot completely and show > > > the > > > following logs in KVM console: > > > > > > [235618.0202207] libceph: connect 192.168.0.19:6789 error -101 > > That is > > > > #defineENETUNREACH 101 /* Network is unreachable */ > > > > Note that that (or any other) socket error is not necessarily fatal; the > > kernel client will retry and eventually connect to that or another OSD > > to complete the IO. Are you observing that the RBD image hangs or > > something? > > > > You can peek at in-flight IO (and other state) with > > > > cat /sys/kernel/debug/ceph/*/osdc > > > > unmount/unmap should not be necessarily in any case unless there is a bug. > > We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay. You > > might try a newer 3.6.x kernel too; I forget if there was a second batch > > of fixes.. > > > > sage > > Hi Sage, > > > #define ENETUNREACH 101 /* Network is unreachable */ > > The reason of this error is that networking stop working after performing > server reset request. > > > Are you observing that the RBD image hangs or something? > > the RBD works properly. It is just mapped and mounted on the client server. > > # /dev/rbd1 99G 616M 93G 1% /home/test I think I'm confused about what you mean by 'server'. Do you mean the host that rbd is mapped on, or the host(s) where the ceph-osd's are running? By 'the RBD works properly' do you mean the client where it is mapped? In which case, what exactly is the problem? > The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' content into > it? 'mount -t debugfs none /sys/kernel/debug' and it will appear (along with other fun stuff)... sage > > I've update kernel to 3.7.4 version but problem is still persist. > > Thanks > > -- > Kind regards, > > R. Alekseev > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPv6 address confusion in OSDs
On Mon, 11 Feb 2013, Simon Leinen wrote: > We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible. > > Today I noticed this error message from an OSD just after I restarted > it (in an attempt to resolve an issue with some "stuck" pgs that > included that OSD): > > 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr > ([2001:620:0:6::106]:6822/1990 != my > [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990) > > These two addresses belong to the same interface: > > root@h1:~# ip -6 addr list dev vlan301 > 7: vlan301@bond0: mtu 1500 > inet6 2001:620:0:6::106/64 scope global >valid_lft forever preferred_lft forever > inet6 fe80::67d:7bff:fef1:78b/64 scope link >valid_lft forever preferred_lft forever > > 2001:620:... is the global-scope address, and this is how OSDs are > addressed in our ceph.conf. fe80:... is the link-local address that > every IPv6 interface has. Shouldn't these be treated as equivalent? Is this OSD by chance sharing a host with one of the monitors? The 'my address' value is learned by looking at the socket we connect to the monitor with... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
Acked-by: Sage Weil On Mon, 11 Feb 2013, Namjae Jeon wrote: > From: Namjae Jeon > > This patch is a follow up on below patch: > > [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type > commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 > > Signed-off-by: Namjae Jeon > Signed-off-by: Vivek Trivedi > Acked-by: Steven Whitehouse > --- > fs/btrfs/export.c |4 ++-- > fs/ceph/export.c|4 ++-- > fs/fuse/inode.c |2 +- > fs/gfs2/export.c|4 ++-- > fs/isofs/export.c |4 ++-- > fs/nilfs2/namei.c |4 ++-- > fs/ocfs2/export.c |4 ++-- > fs/reiserfs/inode.c |4 ++-- > fs/udf/namei.c |4 ++-- > fs/xfs/xfs_export.c |4 ++-- > mm/cleancache.c |2 +- > mm/shmem.c |2 +- > 12 files changed, 21 insertions(+), 21 deletions(-) > > diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c > index 614f34a..81ee29e 100644 > --- a/fs/btrfs/export.c > +++ b/fs/btrfs/export.c > @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, > int *max_len, > > if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) { > *max_len = BTRFS_FID_SIZE_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) { > *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } > > len = BTRFS_FID_SIZE_NON_CONNECTABLE; > diff --git a/fs/ceph/export.c b/fs/ceph/export.c > index ca3ab3f..16796be 100644 > --- a/fs/ceph/export.c > +++ b/fs/ceph/export.c > @@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, > int *max_len, > if (parent_inode) { > /* nfsd wants connectable */ > *max_len = connected_handle_length; > - type = 255; > + type = FILEID_INVALID; > } else { > dout("encode_fh %p\n", dentry); > fh->ino = ceph_ino(inode); > @@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, > int *max_len, > } > } else { > *max_len = handle_length; > - type = 255; > + type = FILEID_INVALID; > } > if (dentry) > dput(dentry); > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c > index 9876a87..973e8f0 100644 > --- a/fs/fuse/inode.c > +++ b/fs/fuse/inode.c > @@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, > int *max_len, > > if (*max_len < len) { > *max_len = len; > - return 255; > + return FILEID_INVALID; > } > > nodeid = get_fuse_inode(inode)->nodeid; > diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c > index 4767774..9973df4 100644 > --- a/fs/gfs2/export.c > +++ b/fs/gfs2/export.c > @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, > int *len, > > if (parent && (*len < GFS2_LARGE_FH_SIZE)) { > *len = GFS2_LARGE_FH_SIZE; > - return 255; > + return FILEID_INVALID; > } else if (*len < GFS2_SMALL_FH_SIZE) { > *len = GFS2_SMALL_FH_SIZE; > - return 255; > + return FILEID_INVALID; > } > > fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32); > diff --git a/fs/isofs/export.c b/fs/isofs/export.c > index 2b4f235..12088d8 100644 > --- a/fs/isofs/export.c > +++ b/fs/isofs/export.c > @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode, >*/ > if (parent && (len < 5)) { > *max_len = 5; > - return 255; > + return FILEID_INVALID; > } else if (len < 3) { > *max_len = 3; > - return 255; > + return FILEID_INVALID; > } > > len = 3; > diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c > index 1d0c0b8..9de78f0 100644 > --- a/fs/nilfs2/namei.c > +++ b/fs/nilfs2/namei.c > @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 > *fh, int *lenp, > > if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) { > *lenp = NILFS_FID_SIZE_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } > if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) { > *lenp = NILFS_FID_SIZE_NON_CONNECTABLE; > - return 255; > + return FILEID_INVALID; > } > > fid->cno = root->cno; > diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c > index 322216a..2965116 100644 > --- a/fs/ocfs2/export.c > +++ b/fs/ocfs2/export.c > @@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 > *fh_in, int *max_len, > > if (parent && (len < 6)) { > *max_len = 6; > - type = 255; > + type = FILEID_INVALID; > goto bail; > } else i
Re: [PATCH 01/15] kv_flat_btree_async.cc: use vector instead of VLA's
Am 10.02.2013 06:57, schrieb Sage Weil: > On Thu, 7 Feb 2013, Danny Al-Gaaf wrote: >> Fix "variable length array of non-POD element type" errors caused by >> using librados::ObjectWriteOperation VLAs. (-Wvla) >> >> Signed-off-by: Danny Al-Gaaf >> --- >> src/key_value_store/kv_flat_btree_async.cc | 14 +++--- >> 1 file changed, 7 insertions(+), 7 deletions(-) >> >> diff --git a/src/key_value_store/kv_flat_btree_async.cc >> b/src/key_value_store/kv_flat_btree_async.cc >> index 96c6cb0..4342e70 100644 >> --- a/src/key_value_store/kv_flat_btree_async.cc >> +++ b/src/key_value_store/kv_flat_btree_async.cc >> @@ -1119,9 +1119,9 @@ int KvFlatBtreeAsync::cleanup(const index_data &idata, >> const int &errno) { >> //all changes were created except for updating the index and possibly >> //deleting the objects. roll forward. >> vector, librados::ObjectWriteOperation*> > ops; >> -librados::ObjectWriteOperation owos[idata.to_delete.size() + 1]; >> +vector owos(idata.to_delete.size() + >> 1); > > I haven't read much of the surrounding code, but from what is included > here I don't think this is equivalent... these are just null pointers > initially, and so > >> for (int i = 0; i <= (int)idata.to_delete.size(); ++i) { >> - ops.push_back(make_pair(pair(0, ""), &owos[i])); >> + ops.push_back(make_pair(pair(0, ""), owos[i])); > > this doesn't do anything useful... owos[i] may as well be NULL. Why not > make it > > vector owos(...) > > ? Because this would lead to a linker error: kv_flat_btree_async.o: In function `void std::__uninitialized_fill_n::__uninit_fill_n(librados::ObjectWriteOperation*, unsigned long, librados::ObjectWriteOperation const&)': /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188: undefined reference to `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation const&)' /usr/bin/../lib64/gcc/x86_64-suse-linux/4.7/../../../../include/c++/4.7/bits/stl_uninitialized.h:188: undefined reference to `librados::ObjectOperation::ObjectOperation(librados::ObjectOperation const&)' Because in src/include/rados/librados.hpp librados::ObjectOperation::ObjectOperation(librados::ObjectOperation const&) was is defined, but not implemented in the librados.cc. Not sure if removing ObjectOperation(librados::ObjectOperation const&) is the way to go here. Danny -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash and strange things on MDS
On Mon, Feb 04, 2013 at 07:01:54PM +0100, Kevin Decherf wrote: > Hey everyone, > > It's my first post here to expose a potential issue I found today using > Ceph 0.56.1. > > The cluster configuration is, briefly: 27 osd of ~900GB and 3 MON/MDS. > All nodes are running Exherbo (source-based distribution) with Ceph > 0.56.1 and Linux 3.7.0. We are only using CephFS on this cluster which > is mounted on ~60 clients (increasing each day). Objects are replicated > three times and the cluster handles only 7GB of data atm for 350k > objects. > > In certain conditions (I don't know them atm), some clients hang, > generate CPU overloads (kworker) and are unable to make any IO on > Ceph. The active MDS have ~20Mbps in/out during the issue (less than > 2Mbps in normal activity). I don't know if it's directly linked but we > also observe a lot of missing files at the same time. > > The problem is similar to this one [1]. > > A restart of the client or the MDS was enough before today, but we found > a new behavior: the active MDS consumes a lot of CPU during 3 to 5 hours > with ~25% clients hanging. > > In logs I found a segfault with this backtrace [2] and 100,000 dumped > events during the first hang. We observed another hang which produces > lot of these events (in debug mode): >- "mds.0.server FAIL on ESTALE but attempting recovery" >- "mds.0.server reply_request -116 (Stale NFS file handle) > client_request(client.10991:1031 getattr As #104bab0 > RETRY=132)" > > We have no profiling tools available on these nodes, and I don't know > what I should search in the 35 GB log file. > > Note: the segmentation fault occured only once but the problem was > observed four times on this cluster. > > Any help may be appreciated. > > References: > [1] http://www.spinics.net/lists/ceph-devel/msg04903.html > [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > 1: /usr/bin/ceph-mds() [0x817e82] > 2: (()+0xf140) [0x7f9091d30140] > 3: (MDCache::request_drop_foreign_locks(MDRequest*)+0x21) [0x5b9dc1] > 4: (MDCache::request_drop_locks(MDRequest*)+0x19) [0x5baae9] > 5: (MDCache::request_cleanup(MDRequest*)+0x60) [0x5bab70] > 6: (MDCache::request_kill(MDRequest*)+0x80) [0x5bae90] > 7: (Server::journal_close_session(Session*, int)+0x372) [0x549aa2] > 8: (Server::kill_session(Session*)+0x137) [0x549c67] > 9: (Server::find_idle_sessions()+0x12a6) [0x54b0d6] > 10: (MDS::tick()+0x338) [0x4da928] > 11: (SafeTimer::timer_thread()+0x1af) [0x78151f] > 12: (SafeTimerThread::entry()+0xd) [0x782bad] > 13: (()+0x7ddf) [0x7f9091d28ddf] > 14: (clone()+0x6d) [0x7f90909cc24d] I found a possible cause/way to reproduce this issue. We have now ~90 clients for 18GB / 650k objects and the storm occurs when we execute an "intensive IO" command (tar of the whole pool / rsync in one folder) on one of our client (the only which uses ceph-fuse, don't know if it's limited to it or not). Any idea? Cheers, -- Kevin Decherf - @Kdecherf GPG C610 FE73 E706 F968 612B E4B2 108A BD75 A81E 6E2F http://kdecherf.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client can't reboot when rbd volume is mounted.
On 11.02.2013 09:36, Sage Weil wrote: On Mon, 11 Feb 2013, Roman Alekseev wrote: Hi, When I try to reboot a client server without unmounting of rbd volume manually its services stop working but server doesn't reboot completely and show the following logs in KVM console: [235618.0202207] libceph: connect 192.168.0.19:6789 error -101 That is #defineENETUNREACH 101 /* Network is unreachable */ Note that that (or any other) socket error is not necessarily fatal; the kernel client will retry and eventually connect to that or another OSD to complete the IO. Are you observing that the RBD image hangs or something? You can peek at in-flight IO (and other state) with cat /sys/kernel/debug/ceph/*/osdc unmount/unmap should not be necessarily in any case unless there is a bug. We backported a bunch of stuff to 3.6.6, so 3.6.10 ought to be okay. You might try a newer 3.6.x kernel too; I forget if there was a second batch of fixes.. sage Hi Sage, > #define ENETUNREACH 101 /* Network is unreachable */ The reason of this error is that networking stop working after performing server reset request. > Are you observing that the RBD image hangs or something? the RBD works properly. It is just mapped and mounted on the client server. # /dev/rbd1 99G 616M 93G 1% /home/test The "/sys/kernel/debug" folder is empty, how to put 'ceph/*/osdc' content into it? I've update kernel to 3.7.4 version but problem is still persist. Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
IPv6 address confusion in OSDs
We run a ten-node 64-OSD Ceph cluster and use IPv6 where possible. Today I noticed this error message from an OSD just after I restarted it (in an attempt to resolve an issue with some "stuck" pgs that included that OSD): 2013-02-11 09:24:57.232811 osd.35 [ERR] map e768 had wrong cluster addr ([2001:620:0:6::106]:6822/1990 != my [fe80::67d:7bff:fef1:78b%vlan301]:6822/1990) These two addresses belong to the same interface: root@h1:~# ip -6 addr list dev vlan301 7: vlan301@bond0: mtu 1500 inet6 2001:620:0:6::106/64 scope global valid_lft forever preferred_lft forever inet6 fe80::67d:7bff:fef1:78b/64 scope link valid_lft forever preferred_lft forever 2001:620:... is the global-scope address, and this is how OSDs are addressed in our ceph.conf. fe80:... is the link-local address that every IPv6 interface has. Shouldn't these be treated as equivalent? -- Simon. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs: encode_fh: return FILEID_INVALID if invalid fid_type
From: Namjae Jeon This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: 216b6cbdcbd86b1db0754d58886b466ae31f5a63 Signed-off-by: Namjae Jeon Signed-off-by: Vivek Trivedi Acked-by: Steven Whitehouse --- fs/btrfs/export.c |4 ++-- fs/ceph/export.c|4 ++-- fs/fuse/inode.c |2 +- fs/gfs2/export.c|4 ++-- fs/isofs/export.c |4 ++-- fs/nilfs2/namei.c |4 ++-- fs/ocfs2/export.c |4 ++-- fs/reiserfs/inode.c |4 ++-- fs/udf/namei.c |4 ++-- fs/xfs/xfs_export.c |4 ++-- mm/cleancache.c |2 +- mm/shmem.c |2 +- 12 files changed, 21 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c index 614f34a..81ee29e 100644 --- a/fs/btrfs/export.c +++ b/fs/btrfs/export.c @@ -22,10 +22,10 @@ static int btrfs_encode_fh(struct inode *inode, u32 *fh, int *max_len, if (parent && (len < BTRFS_FID_SIZE_CONNECTABLE)) { *max_len = BTRFS_FID_SIZE_CONNECTABLE; - return 255; + return FILEID_INVALID; } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) { *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE; - return 255; + return FILEID_INVALID; } len = BTRFS_FID_SIZE_NON_CONNECTABLE; diff --git a/fs/ceph/export.c b/fs/ceph/export.c index ca3ab3f..16796be 100644 --- a/fs/ceph/export.c +++ b/fs/ceph/export.c @@ -81,7 +81,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, int *max_len, if (parent_inode) { /* nfsd wants connectable */ *max_len = connected_handle_length; - type = 255; + type = FILEID_INVALID; } else { dout("encode_fh %p\n", dentry); fh->ino = ceph_ino(inode); @@ -90,7 +90,7 @@ static int ceph_encode_fh(struct inode *inode, u32 *rawfh, int *max_len, } } else { *max_len = handle_length; - type = 255; + type = FILEID_INVALID; } if (dentry) dput(dentry); diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 9876a87..973e8f0 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -679,7 +679,7 @@ static int fuse_encode_fh(struct inode *inode, u32 *fh, int *max_len, if (*max_len < len) { *max_len = len; - return 255; + return FILEID_INVALID; } nodeid = get_fuse_inode(inode)->nodeid; diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c index 4767774..9973df4 100644 --- a/fs/gfs2/export.c +++ b/fs/gfs2/export.c @@ -37,10 +37,10 @@ static int gfs2_encode_fh(struct inode *inode, __u32 *p, int *len, if (parent && (*len < GFS2_LARGE_FH_SIZE)) { *len = GFS2_LARGE_FH_SIZE; - return 255; + return FILEID_INVALID; } else if (*len < GFS2_SMALL_FH_SIZE) { *len = GFS2_SMALL_FH_SIZE; - return 255; + return FILEID_INVALID; } fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32); diff --git a/fs/isofs/export.c b/fs/isofs/export.c index 2b4f235..12088d8 100644 --- a/fs/isofs/export.c +++ b/fs/isofs/export.c @@ -125,10 +125,10 @@ isofs_export_encode_fh(struct inode *inode, */ if (parent && (len < 5)) { *max_len = 5; - return 255; + return FILEID_INVALID; } else if (len < 3) { *max_len = 3; - return 255; + return FILEID_INVALID; } len = 3; diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 1d0c0b8..9de78f0 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -517,11 +517,11 @@ static int nilfs_encode_fh(struct inode *inode, __u32 *fh, int *lenp, if (parent && *lenp < NILFS_FID_SIZE_CONNECTABLE) { *lenp = NILFS_FID_SIZE_CONNECTABLE; - return 255; + return FILEID_INVALID; } if (*lenp < NILFS_FID_SIZE_NON_CONNECTABLE) { *lenp = NILFS_FID_SIZE_NON_CONNECTABLE; - return 255; + return FILEID_INVALID; } fid->cno = root->cno; diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c index 322216a..2965116 100644 --- a/fs/ocfs2/export.c +++ b/fs/ocfs2/export.c @@ -195,11 +195,11 @@ static int ocfs2_encode_fh(struct inode *inode, u32 *fh_in, int *max_len, if (parent && (len < 6)) { *max_len = 6; - type = 255; + type = FILEID_INVALID; goto bail; } else if (len < 3) { *max_len = 3; - type = 255; + type = FILEID_INVALID; goto bail; } diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 30195bc.