Re: [OpenAFS] replacement for depot?
On 2012-09-26 at 18:20, Jason Edgecombe ( ja...@rampaginggeek.com ) said: Hi everyone, I'm using a program called "depot", which I think used to be included with IBM/Transarc AFS. I'm planning to migrated from RHEL5 with cfengine to RHEL6 with puppet. I use depot to manage many folders of symlinks. What would you recommend as a replacement to depot? FYI, the fsi_generate commands is used to generate a depot.image file that the depot command uses to build the symlink farm. I've used GNU stow with some wrapper scripts for volume creation, replication, mounting, permissions, etc, to manage software installation into AFS. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] sysname for 3.x linux kernel
On 2012-03-08 at 15:49, Dave Botsch ( bot...@cnf.cornell.edu ) said: I just set the sysname to whatever I want it to be. Cfengine sticks a "/usr/bin/fs sysname -newsys" command in the /etc/init.d/openafs-client script in the startup) section Interesting. What are you setting the sysname to in that case? Anything cfengine-specific? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Feature wish: remove partition while fileserver keeps on running
On 2012-02-27 at 17:00, Lars Schimmer ( l.schim...@cgv.tugraz.at ) said: Hi! Maybe I missed a point or two, but I wish I could remove and unmount a /vicepX partition while fileserver keeps on running. The last weeks I needed to redo our iSCSI storages and that implies a lot of mount/unmount/redo partitions of our OpenAFS fileservers. Each time I need to add/remove a partition, the safe way was to stop the OpenAFS fileserver, mount/umount the partition and restart the fileserver. As I do not want to be the night owl, I did it in usual work shift - which did annoy our users as service was broken a few minutes. Is there any way to do this a better way? (IMHO DAFS is only for volumes, not partitions, or?) That is correct. However, DAFS can make the current methods of adding/removing /vicep's a bit less painful. In the past, what I've typically done to remove paritions is completely evacuate them with vos remove/move etc, unmount the partition, then bos restart dafs. Unmounting a partition while a fileserver might still be accessing it is risky, but if you're absolutely sure that there's nothing left on it, then this is more or less safe IMO. You could also bos shutdown / unmount / bos startup if you want to be paranoid. Adding partitions is easier. Mount the new partition, then restart. With DAFS, restart times are extremely fast, and I believe callback state is preserved across restarts, so your clients shouldn't notice the restart if everything is working correctly. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] improving cache partition performance
On 2011-08-29 at 19:39, Jason Edgecombe ( ja...@rampaginggeek.com ) said: I was told that noatime is bad for an AFS cache partition because AFS uses the atime to know when the cache entry was last accessed. Oops, looks like you're right, unless someone more knowledgeable says otherwise. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] improving cache partition performance
Just to add some more datapoints to the discussion... Our webservers are HP DL360G5's, 14GB RAM. Pair of 36GB 15K 2.5" SAS drives in RAID-1. /usr/vice/cache is ext2 with noatime,nodiratime. These machines run dovecot IMAP, apache with lots of php applications, RT, and vsftpd serving anon and private ftp accounts. Serving content that isn't in the cache yet, we can get about 70-80MB/s depending on which fileserver it's coming from, and after it's cached, the gigabit network becomes the bottleneck. The cache partition is ~34GB in size, and we're running with these options: -dynroot -dynroot-sparse -fakestat -afsdb -nosettime -daemons 20 -stat 48000 -volumes 2048 -chunksize 19 -rxpck 2048 With those cache manager settings, cache partition utilization is sitting at about 92%. I can get even better numbers with memcache, and indeed most of our other machines are running with 2GB of memcache, I like seeing read performance in GB/s, and when most of your machines have 32GB or more (we have 3 with 256GB), a couple GB here and there won't have a noticeable impact. Jason: do you know in particular what kind of workload is causing issues for you? You mentioned your wait times are on the order of seconds, are you sure that's caused by the underlying disk? At the very least, I would try mounting your cache partition as ext2, as has already been suggested. Turning off atime and diratime shouldn't hurt, and if your disks are having issues with seaks, this should help some. Also, you really want to run 1.6.0pre7, or 1.6.0 when it shows up. Nothing wrong with 1.4, but if you're trying to get the most performance out of afs on modern hardware, switching to 1.6 gives you some real cheap gains. There are huge performance improvements on Linux going from 1.4 to 1.6, and all of my new installations are 1.6.0pre7 for that reason. Especially with disk-based caches, as Simon mentioned. 1.6.0pre7 gets write performance for disk caches almost on par with memcache, though read performance is still lacking, as memory will almost always be faster than disk, but disk will always be 'cheaper' than memory. Worth a try at least, and pre7 has been very stable in our environment. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Solaris 10 deadlock issue
On 2011-06-17 at 12:07, Andrew Deason ( adea...@sinenomine.net ) said: On Fri, 17 Jun 2011 13:01:33 -0400 (EDT) Benjamin Kaduk wrote: This issue sounds rather similar (superficially, at least) to one we've been seeing on FreeBSD clients. When you say that "something has changed ...", is that something you think is OS-specific AFS code, OS code, or generic AFS code? Something has changed in the Solaris kernel, since this problem does not occur with earlier versions of Sol10 u8. Can someone summarise which kernel versions / solaris updates and openafs versions are affected? Is there any combination of the openafs client and u9 that works right now? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] OpenAFS 1.6pre5 Ubuntu ppa
On 2011-06-14 at 14:00, Nicolas Bourbaki ( ncl.bourb...@gmail.com ) said: Hi guys, I'd like to know if the Ubuntu ppa repository has been upgraded to offer the latest 1.6pre6 version of OpenAFS. I'm having odd behavior when using the latest version available on the following ppa: - http://ppa.launchpad.net/openafs/master/ubuntu/pool/main/o/openafs/ - openafs-client_1.6.0~pre5-2~ppa0~maverick1_i386.deb When opening my desktop session (user dirs on AFS), I have the following: afs: Lost contact with file server XXX.XXX.XXX.XXX in cell yyy.yyy afs: Lost contact with file server XXX.XXX.XXX.XXX in cell yyy.yyy afs: failed to store file (110) afs: failed to store file (110) afs: failed to store file (110) afs: file server XXX.XXX.XXX.XXX in cell yyy.yyy is back up afs: file server XXX.XXX.XXX.XXX in cell yyy.yyy is back up In case it's related, one of our users on debian wheezy was experiencing the same symptoms while trying to 'hg pull'. Problem didn't show up until the machine had been up for about 2 weeks, and went away after a reboot. Was going to wait and see if it shows up again before reporting it. openafs-client-1.6.0~pre5-2 2.6.38-2-amd64 fstrace is here: /afs/bx.psu.edu/user/phalenor/public/fstrace_dump.txt If this isn't related to the problem above (not sure how close debian and ubuntu wrt openafs), I'll send a real bug report if/when my problem shows up again. But in case it is related... --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Expected performance
On 2011-05-20 at 14:10, Andrew Deason ( adea...@sinenomine.net ) said: On Fri, 20 May 2011 13:51:05 -0400 (EDT) Andy Cobaugh wrote: ...or it's that you're writing to the same disk twice as much. If the cache and /vicepX are on the same disk, it seems pretty intuitive that it's going to be slower. It was with memcache. Well _I_ wasn't talking about memcache. :) Of course. When I'm talking about performance, I'm almost never talking about disk cache ;) mal and badger are slightly different hardware, but the tests above show that we get very similar performance between all server and client combinations except the case where client and server are on the same machine. Maybe I'm missing something here? I think to some extent it can still be that they're just using the same hardware resources, so some performance loss is to be expected (if the network wasn't the bottleneck for separate machines). I'm not sure if that can explain that degree of difference, though. I believe Rx in the past has had some odd behavior that really low RTT, but any known fixes there should have been in 1.6 for awhile. I expect a similar thing happens on 1.4? Though of course the baseline performance is probably different, so it's not really comparing the same thing. I think there were differences with 1.4, but it's been a while since that particular machine has run 1.4 that I don't remember exactly. One would think that a modern 8-core box with 32GB of memory would provide for enough 'isolation' between server and client. Maybe I just don't know enough about what resources are involved in that case. I'm still curious what's actually causing that much of a performance loss. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Expected performance
On 2011-05-19 at 14:19, Andrew Deason ( adea...@sinenomine.net ) said: On Thu, 19 May 2011 14:57:16 -0400 (EDT) Andy Cobaugh wrote: You can certainly get close if your disk for the disk cache is fast enough. I've seen close to 80MB/s with 15K SAS under ideal conditions. Re: client and server on the same machine - I've seen that actually result in lower performance. When you take the physical network out of the mix, Rx starts limiting you as a function of CPU usage it seems. ...or it's that you're writing to the same disk twice as much. If the cache and /vicepX are on the same disk, it seems pretty intuitive that it's going to be slower. It was with memcache. Just ran some quick tests yesterday to confirm what I saw before. Here I have two different clients, 'mal' and 'badger'. badger has a fileserver. There is another fileserver, fs8, which serves the purpose of showing maximum client performance (fs8 is our biggest and fastest fileserver currently). Clients on both mal and badger have essentially the same config, using a 655360 byte memcache. iozone was used in all tests. client -> server mal -> fs8: http://www.bx.psu.edu/~phalenor/afs_performance_results/mal.bx.psu.edu-201105121302/ mal -> badger: http://www.bx.psu.edu/~phalenor/afs_performance_results/mal.bx.psu.edu-201105191536/ badger -> fs8: http://www.bx.psu.edu/~phalenor/afs_performance_results/badger.bx.psu.edu-201105191708/ badger -> badger: http://www.bx.psu.edu/~phalenor/afs_performance_results/badger.bx.psu.edu-201105191618/ mal and badger are slightly different hardware, but the tests above show that we get very similar performance between all server and client combinations except the case where client and server are on the same machine. Maybe I'm missing something here? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Expected performance
On 2011-05-19 at 13:25, Andrew Deason ( adea...@sinenomine.net ) said: On Tue, 17 May 2011 23:14:03 +0100 Hugo Monteiro wrote: - Low performance and high discrepancy between test results Transfer rates (only a few) hardly touched 30MB/s between the server and a client sitting on the same network, connected via GB ethernet. Most of the times that transfer rate is around 20MB/s, falling down to 13 or 14MB/s in some cases. The client and server configs would help. I'm not used to looking at single-client performance, but... assuming you're using a disk cache, keep in mind the data is written twice: once to the cache and once on the server. So, especially when you're running the client and server on the same machine, there's no way you're going to reach the theoretical 110M/s of the disk. You can certainly get close if your disk for the disk cache is fast enough. I've seen close to 80MB/s with 15K SAS under ideal conditions. Re: client and server on the same machine - I've seen that actually result in lower performance. When you take the physical network out of the mix, Rx starts limiting you as a function of CPU usage it seems. You may want to see what you get with memcache (or if you want to try a 1.6 client, cache bypass) and a higher chunksize. Just running dd on a box I have, running a 1.4 afsd with -memcache -chunksize 24 made it jump from the low 20s to high 40s/low 50s (M/s), after starting with the defaults for a 100M disk cache. Just to add some more data points... I recently saw peaks of 90M/s for memcache for single client writes. Reads from memcache can be as fast as your memory is, so upwards of a couple GB/s. In general, 1.6 memcache > 1.4 memcache > 1.6 diskcache > 1.4 diskcache. 1.6 disk cache uses a LOT less CPU than 1.4 disk cache, however. Nice for processes that need IO and CPU at the same time on a machine that might already be lacking CPU. Options I used to get those numbers with 1.6.0pre5: Client: -dynroot -fakestat -afsdb -nosettime -stat 48000 -daemons 12 -volumes 512 -memcache -blocks 655360 -chunksize 19 Server: -p 128 -busyat 600 -rxpck 4096 -s 1 -l 1200 -cb 100 -b 240 -vc 1200 -abortthreshold 0 -udpsize 1048576 Server in this case is a very new 16-core Opteron box with 32GB of RAM (it runs multiple fileserver instances under Solaris zones). Client is a relatively new 8-core Opteron box with 64GB of memory. Also in general, client performance seems to get worse the more CPUs you have. Our 48-core boxes tend to get lower numbers than our smaller 16 and 8 core boxes. I haven't done too many comparison tests to really quantify how much of a difference that makes, though. Cache bypass definitely makes things faster for things that aren't cached, though I will withold performance numbers for that as I was testing bypass inside a ESX vm (one of our webservers), but within the same machine, it got similar numbers to disk cache after the files had been cached (where disk cache is a raw FC LUN) Under normal conditions with fairly modern hardware, you should expect 50M/s with some simple tuning (-chunksize mostly, and -memcache if your machine has the memory to spare). I haven't done any testing for the multi-client case, as that's slightly more difficult to properly test while holding everything else constant. By multi-client, I mean multiple actual cache managers involved as well as multiple users behind the same cache manager. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] When to publish security advisories?
On 2011-04-15 at 16:46, Russ Allbery ( r...@stanford.edu ) said: Patricia O'Reilly writes: Is there any problem connecting 1.6 clients with 1.4.14 servers? Nope. Works fine. Overall, 1.6 clients seem to be working as well or better than 1.4 clients, although someone has reported reproducible hangs and crashes to me with 1.6 (and I've been trying to get him to file a bug report). But I don't know of anyone else having that trouble. Any day now we'll be pushing 1.6.0pre4 out to all of our Linux clients. Some of them have been running various versions of 1.5 for many many months now (web servers and the like). My testing shows 1.6 is noticeably faster than 1.4, disk cache and memcache. Most of our servers are still 1.4.x. As far as my site is concerned, we're considering the 1.6.0pre4 client stable on Linux. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Reporting on some recent benchmark results
On 2011-04-06 at 14:41, Andrew Deason ( adea...@sinenomine.net ) said: On Wed, 6 Apr 2011 15:29:58 -0400 (EDT) Andy Cobaugh wrote: No; I didn't even think that was ours to handle. So, if you stop the client, the AFS 'mount' entry stays there? I assume the multiple AFS lines are identical? What kernel? Lines are identical, as such: AFS on /afs/ type afs (rw) Somewhere you are specifying /afs/ as the AFS mountpoint, instead of /afs (such as in /usr/vice/etc/cacheinfo, or afsd args). If you change it to /afs, this appears to go away. Yep, in cacheinfo. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: Reporting on some recent benchmark results
On 2011-04-06 at 14:25, Andrew Deason ( adea...@sinenomine.net ) said: On Wed, 6 Apr 2011 11:44:17 -0400 (EDT) Andy Cobaugh wrote: One observation regarding 1.6.0pre4: In stop'ing and start'ing the client via the init script, AFS shows up as being mounted several times in the output of 'mount' - is that to be expected? No; I didn't even think that was ours to handle. So, if you stop the client, the AFS 'mount' entry stays there? I assume the multiple AFS lines are identical? What kernel? Lines are identical, as such: AFS on /afs/ type afs (rw) Running 2.6.18-194.32.1.el5 Interestingly, if I umount /afs/ after I shutdown the client, I get "umount: /afs/: not mounted", but the mount entries go away one at a time with each umount invocation. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Reporting on some recent benchmark results
On 2011-04-06 at 16:06, Simon Wilkinson ( s...@inf.ed.ac.uk ) said: On 4 Apr 2011, at 22:18, Garrett Wollman wrote: Over the past few days I have performed several benchmarks comparing the performance of various OpenAFS server and client configurations. Thanks for this - it makes for really interesting reading. The statistic I'm really interested in at present, unfortunately, isn't one that you cover. With the imminent release of 1.6.0, what would be really interesting to know is a direct comparison between 1.4.14 and 1.6.0 on the same hardware, for the same workload. I know of workloads in which I can clearly show that 1.6.0 is faster, what would be really useful is to see, and to understand, is workloads for which it is slower. All of my iozone tests are here: http://www.bx.psu.edu/~phalenor/afs_performance_results/ For each run, you get the raw iozone output, and an 'info' file that collects information about the client: version, memory, afsd options, and location of the test volume. I'm only interested in single client, single thread performance - when your users are dealing with files 10's and 100's of GB in size, that's all you really care about. The recent tests on the 'c2' machine are my attempt to decide whether to deploy 1.6.0pre4 on all of our clients in place of 1.4.14. 1.6.0pre4 with memcache is looking very promising so far, easily capable of saturating a gigabit connection under the right conditions.. Of course, none of our tests are with encryption turned on. In our experience, it's far too easy for just a few clients to bring down even some of our fastest fileservers when they're all on gigabit. One observation regarding 1.6.0pre4: In stop'ing and start'ing the client via the init script, AFS shows up as being mounted several times in the output of 'mount' - is that to be expected? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] openafs 1.6.0pre4 and OSX 10.6.7 and 64bit kernel (NOT really) FIXED ;(
On 2011-03-29 at 20:10, Chris Jones ( christopher.rob.jo...@cern.ch ) said: Hi, Chris-Jones-Macbook-Pro /Library/OpenAFS/Tools/bin > ./cmdebug localhost Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:1133)) ** Cache entry @ 0xd35161a0 for 0.1.16777996.1 [dynroot] locks: (none_waiting, write_locked(pid:1133 at:599)) 18 bytes DV1 refcnt 0 callback expires 0 0 opens 0 writers mount point states (0x5), stat'd, read-only and a slightly different one later on (whilst waiting to just cd into a directory under /afs/cern.ch) Chris-Jones-Macbook-Pro /Library/OpenAFS/Tools/bin > ./cmdebug localhost Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:1156)) ** Cache entry @ 0xd35184b0 for 382.537112396.26.32 [cern.ch] locks: (none_waiting, write_locked(pid:1156 at:66)) 7 bytes DV1 refcnt 0 callback 263a6708expires 1301440202 0 opens 0 writers normal file states (0x1), stat'd fwiw, we started seeing this on Leopard as early as 1.5.77. I just now saw this on 1.6.0pre4 on Snow Leopard with the 32-bit kernel. It's also happened with 1.6.0pre2 on Leopard. Sometimes it hangs for only a few minutes. Other times, it will hang for hours until someone reboots. cmdebug always reports 1 or more read locks on afs_discon_lock, with a random pid. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-07 at 11:27, Andrew Deason ( adea...@sinenomine.net ) said: On Fri, 4 Mar 2011 16:23:34 -0500 (EST) Andy Cobaugh wrote: Volume name in question is pub.m.rpmforge. The .backup volume in particular. This volume was backup'd this morning at approx. 0005, with this output from vos backup: Failed to end the transaction on the rw volume 536873153 : server not responding promptly Error in vos backup command. : server not responding promptly Does this happen often enough that you could tell me if a patch makes it go away? I'd like to know if this fixes it (it'll apply to 1.6.0pre2 with a little fuzz): <http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=69077559a7fc5784445ed56a2bfd613a5bb4174b> I'd like to wait for it to happen one more time before calling it a problem. I'll try that if it happens again. Given the frequency that this has happened, I wouldn't be surprised if it happens again before wednesday. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-07 at 11:03, Andrew Deason ( adea...@sinenomine.net ) said: On Fri, 4 Mar 2011 19:42:04 -0500 (EST) Andy Cobaugh wrote: Tue Mar 1 00:02:12 2011 VReadVolumeDiskHeader: Couldn't open header for volume 536871061 (errno 2) means the volume doesn't exist. It's not that it's corrupt or anything; the volume was completely deleted. (or something just deleted the .vol header, but the other messages suggest it was deleted normally) What does 'deleted normally' mean in this context? Nothing touched the volume since the previous night, where it created the .backup volume just fine. Unfortunately, those logs have since rolled over, so I don't have anything older than from when I restarted the fileserver at 16:12 on Mar 1. Deleted normally as in, a 'vos remove' or 'vos zap'. The volume header didn't exist, and we didn't encounter any extant files when recreating the clone, suggesting that the backup clone was cleanly deleted before we tried making a new one. Nope, nothing like that, so it must have been deleted abnormally somehow. I'll keep an eye out for this next time. Yes, when the volume got caught in that state, any access could have triggered a salvage (since it was in a half-created state). So, an examine or someone just trying to access 'yesterday' (or whatever you call it) could have caused that. daily_backup_snapshot That was probably it. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-04 at 16:30, Andrew Deason ( adea...@sinenomine.net ) said: On Fri, 4 Mar 2011 17:20:34 -0500 (EST) Andy Cobaugh wrote: The first issue you reported had problems much earlier before the log messages you gave. Did anything happen to the backup volume before that? No messages referencing that volume id? Did you or someone/thing else remove the backup clone or anything? Nope. We don't even access the backup volume when doing the file-level backups anymore. Well, _something_ deleted it, unless it didn't exist before 1 mar 2011. This message It certainly did exist before that, and nothing I did and no part of our backup system would have delete it. Tue Mar 1 00:02:12 2011 VReadVolumeDiskHeader: Couldn't open header for volume 536871061 (errno 2) means the volume doesn't exist. It's not that it's corrupt or anything; the volume was completely deleted. (or something just deleted the .vol header, but the other messages suggest it was deleted normally) What does 'deleted normally' mean in this context? Nothing touched the volume since the previous night, where it created the .backup volume just fine. Unfortunately, those logs have since rolled over, so I don't have anything older than from when I restarted the fileserver at 16:12 on Mar 1. Yes, the zaps were me trying to get the .backup into a usable state. Though, the first string of salvages started in the middle of the afternoon without any intervention - I think the event that caused them is what's missing from the picture. Well, do you have the messages from around then? Ugh, no. Hopefully I will if it happens again. I'm still a little hesitant to bos salvage that server - whole reason we're trying to switch to DAFS is to avoid the multi-hour fileserver outages. Salvaging a single volume is the same as a demand-salvage; it is no slower and no more impactful than an automatically-triggered one. But you can manually trigger the salvage of a single volume group in cases like this (e.g. when the fileserver refuses to because it's been salvaged too many times). Ok, I had to bos salvage the .backup volume directly with -forceDAFS. When I did this when this happened on my machine at home, it wasn't so easy. In that case, it was with an RO clone. I think I had to remsite, then remove or zap or some combination, along with manually deleting the .vol. I wish I had payed closer attention then. I still have no idea what caused the volume to spontaneously need salvaging Tuesday afternoon. I did notice that until I fixed the BK volume, if I did a 'vos exam home.gsong.backup', that triggered a salvage. Wish I had more to go on. I'll be working on standardizing our logging configuration across servers next week, logging via syslog, etc, so we don't lose valuable logs like this. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-04 at 15:59, Andrew Deason ( adea...@sinenomine.net ) said: What about the command immediately preceding this? Anything odd about it; time it took to execute, or any warnings/errors/etc? The commands before that all completed in 30 seconds or less. No messages other than that. I'm not sure how related this is to the other issue I saw, where the backup clone was left in a much worse state. I don't think it is; that error above isn't even really much of a problem; we just failed to end the transaction, but the the transaction is idle by that point and will be ended automatically after 5 minutes (as you see in the VolserLog). The first issue you reported had problems much earlier before the log messages you gave. Did anything happen to the backup volume before that? No messages referencing that volume id? Did you or someone/thing else remove the backup clone or anything? Nope. We don't even access the backup volume when doing the file-level backups anymore. The first messages around Tue Mar 1 00:02:12 2011 look like what would happen if you tried to recreate the BK after it was deleted with that code (fixed in the patches I mentioned before). The subsequent salvages are from an error to read some header data, which could be explained by the attempted 'zap's and such, assuming those messages were during/after you noticed the volume being inaccessible and tried forcefully deleting it. Yes, the zaps were me trying to get the .backup into a usable state. Though, the first string of salvages started in the middle of the afternoon without any intervention - I think the event that caused them is what's missing from the picture. I'm still a little hesitant to bos salvage that server - whole reason we're trying to switch to DAFS is to avoid the multi-hour fileserver outages. I'm going to take some time either later tonight, or early next week to go back through the logs and try to make more sense of them from a chronological standpoint, and see if there's anything I missed. There's still a bug somewhere that causes a .backup volume to go off-line after being created. I have a test volume on one of the problem fileservers right now, that's been vos backup'd once a minute since yesterday without a problem. So, something else must have to happen to cause this, just not sure what. -- Andrew Deason adea...@sinenomine.net ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
Ok, an update to the problem I alluded to this morning. Volume name in question is pub.m.rpmforge. The .backup volume in particular. This volume was backup'd this morning at approx. 0005, with this output from vos backup: Failed to end the transaction on the rw volume 536873153 : server not responding promptly Error in vos backup command. : server not responding promptly That command returned in <5s. I then see this in VolserLog: Fri Mar 4 00:05:15 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155 Fri Mar 4 00:10:19 2011 trans 13950 on volume 536873153 has been idle for more than 300 seconds Fri Mar 4 00:10:49 2011 trans 13950 on volume 536873153 has been idle for more than 330 seconds Fri Mar 4 00:11:19 2011 trans 13950 on volume 536873153 has been idle for more than 360 seconds Fri Mar 4 00:11:49 2011 trans 13950 on volume 536873153 has been idle for more than 390 seconds Fri Mar 4 00:12:19 2011 trans 13950 on volume 536873153 has been idle for more than 420 seconds Fri Mar 4 00:12:49 2011 trans 13950 on volume 536873153 has been idle for more than 450 seconds Fri Mar 4 00:13:19 2011 trans 13950 on volume 536873153 has been idle for more than 480 seconds Fri Mar 4 00:13:49 2011 trans 13950 on volume 536873153 has been idle for more than 510 seconds Fri Mar 4 00:14:19 2011 trans 13950 on volume 536873153 has been idle for more than 540 seconds Fri Mar 4 00:14:49 2011 trans 13950 on volume 536873153 has been idle for more than 570 seconds Fri Mar 4 00:15:19 2011 trans 13950 on volume 536873153 has been idle for more than 600 seconds Fri Mar 4 00:15:19 2011 trans 13950 on volume 536873153 has timed out Nothing in any of the other log files, and nothing interesting in FileLog other than: Mar 4 00:05:15 horvitz fileserver[2236]: VOffline: Volume 536873153 (pub.m.rpmforge) is now offline (A volume utility is running.) Mar 4 00:05:15 horvitz fileserver[2236]: fssync: breaking all call backs for volume 536873155 (and then tsm goes to access the RW volume, at which point I guess it's brought back online) Mar 4 01:00:31 horvitz fileserver[2236]: SAFS_FetchStatus, Fid = 536873153.1.1, Host 128.118.200.6:7001, Id 117 Mar 4 01:00:31 horvitz fileserver[2236]: VOnline: volume 536873153 (pub.m.rpmforge) attached and online I noticed this when nagios reported that one of the volumes on this server was marked off-line. Now, interestingly, I just ran another vos backup against the same volume: $ vos backup pub.m.rpmforge Created backup volume for pub.m.rpmforge Fri Mar 4 16:05:14 2011 1 Volser: Clone: Recloning volume 536873153 to volume 536873155 The pub.m.rpmforge.backup is now on-line. Subsequent backups seem to be fine. I'm not sure how related this is to the other issue I saw, where the backup clone was left in a much worse state. The immediate effects of the vos backup are the same, but I'm still not sure what caused the demand salvage of the volume later during the day in that case. In that case, that was a home directory that was very much in use, and something triggered the salvage, there's just nothing in the logs to indicate why. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
Fyi, I have another .backup volume that started having the same issues this morning on a different machine under 32-bit linux with 1.6.0pre2. I'll gather some more details later today. This same fileserver had no issues running 1.5.77, 1.5.78, or 1.6.0pre1 (well, other than vos backup not working at all with certain versions). --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-03 at 11:05, Andrew Deason ( adea...@sinenomine.net ) said: On Tue, 1 Mar 2011 22:23:34 -0600 Andrew Deason wrote: The problem with the recovery is (probably) that the salvager doesn't properly inform the fileserver when it destroys a volume, so the erroneous volume state prevents you from doing anything with the volume after it's destroyed. I need to test that behavior out tomorrow and see what happens. This is what happens, and can be easily seen if you corrupt the header for a clone, try to access it, and try to recreate it after the salvager deletes it. Gerrit 4117-4120 have been submitted to fix this. Excellent. So I guess the remaining question is: how did the header get corrupted in the first place. I'll be sure to keep a closer eye on things next time I see this. I've seen this twice on two completely different systems (my home machine, and a production fileserver at work, both after upgrading to 1.6 [and I think they were both pre2]), so I'm sure I'll see it again, just a matter of time. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-01 at 22:23, Andrew Deason ( adea...@sinenomine.net ) said: On Tue, 1 Mar 2011 22:38:07 -0500 (EST) Andy Cobaugh wrote: (and I think you meant dafssync-debug. I may not have mentioned that.) fssync-debug should detect a DAFS fileserver and execute dafssync-debug for you. If I just do fssync-debug, it tells me this: *** server asserted demand attach extensions. fssync-debug not built to *** recognize those extensions. please recompile fssync-debug if you need *** to dump dafs extended state Have you done successful 'vos backup's of that volume after the 1.6.0pre2 upgrade? Or did you upgrade and it broke? Oh yes, definitely. It was upgraded on Feb 19. Hmm, well, I interpreted "turned debugging up" to mean "up all the way", which actually probably isn't true. The messages I'm looking for are at level 125, and there's a lot of them (they log every FSSYNC request and response). Yeah, only running at 5 right now. If I look in FileLog.old (I restarted at some point to up the debug level), I see these lines: You can change that with SIGHUP/SIGTSTP (unless you're doing that for a permanent change). Is that to increase/decrease logging level, respectively? Tue Mar 1 16:11:34 2011 FSYNC_com: read failed; dropping connection (cnt=94804) Tue Mar 1 16:11:34 2011 FSYNC_com: read failed; dropping connection (cnt=94805) There should be a SYNC_getCom right before these (though it probably just says "error receiving command"). Just to be sure, there aren't any processes dying/respawning in BosLog{,.old}, are there? No processing dying, fortunately. Failed to end the transaction on the rw volume 536871059 : server not responding promptly Error in vos backup command. : server not responding promptly That's RX_CALL_TIMEOUT, which I'm not used to seeing on volserver RPCs... Do you know how long it took to error out with that? If it takes a while, a core of the volserver/fileserver while it's hanging would be ideal. It might just be the fileserver trying to salvage the volume a bunch of times or something, though, and that takes too long. From the start of the vos backup command until it returned was 16s according to our logs. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug
On 2011-03-01 at 20:27, Andrew Deason ( adea...@sinenomine.net ) said: On Tue, 1 Mar 2011 18:57:50 -0500 (EST) Andy Cobaugh wrote: This has happened at least once at work on Solaris 10 x86 with a .backup volume, as seen above, and at least once on one of my home machines on 64bit linux with an RO clone. The volume was probably deleted during the salvage (it was already gone by the time of the 'zap -force'), but the fileserver still has the volume in an 'error' state. Could you volinfo /vicepa 536871061 fssync-debug query 536871061 # volinfo /vicepcb 536871061 Inode 2305843649298038783: Good magic 78a1b2c5 and version 1 Inode 2305843649365147647: Good magic 99776655 and version 1 Inode 2305843649432256511: Good magic 88664433 and version 1 Inode 2305843641043648511: Good magic 99877712 and version 1 Volume header for volume 536871061 (home.gsong.backup) stamp.magic = 78a1b2c5, stamp.version = 1 inUse = 0, inService = 0, blessed = 1, needsSalvaged = 0, dontSalvage = 0 type = 2 (backup), uniquifier = 6743359, needsCallback = 0, destroyMe = d3 id = 536871061, parentId = 536871059, cloneId = 0, backupId = 536871061, restoredFromId = 0 maxquota = 134217728, minquota = 0, maxfiles = 0, filecount = 221896, diskused = 56684296 creationDate = 1299021740 (2011/03/01.18:22:20), copyDate = 1299021740 (2011/03/01.18:22:20) backupDate = 1299021740 (2011/03/01.18:22:20), expirationDate = 0 (1969/12/31.19:00:00) accessDate = 1299021734 (2011/03/01.18:22:14), updateDate = 1299021636 (2011/03/01.18:20:36) owner = 1045, accountNumber = 0 dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 0 (1969/12/31.19:00:00) volUpdateCounter = 135816 (and I think you meant dafssync-debug. I may not have mentioned that.) # dafssync-debug query 536871061 calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY) FSSYNC service returned 0 (SYNC_OK) protocol header response code was 0 (SYNC_OK) protocol reason code was 0 (SYNC_REASON_NONE) volume = { hashid = 536871061 header = 0 device = 79 partition = 102a75a8 linkHandle = 0 nextVnodeUnique = 0 diskDataHandle = 0 vnodeHashOffset = 79 shuttingDown= 0 goingOffline= 0 cacheCheck = 0 nUsers = 0 needsPutBack= 0 specialStatus = 0 updateTime = 0 vnodeIndex[vSmall] = { handle = 0 bitmap = 0 bitmapSize = 0 bitmapOffset = 0 } vnodeIndex[vLarge] = { handle = 0 bitmap = 0 bitmapSize = 0 bitmapOffset = 0 } updateTime = 0 attach_state= VOL_STATE_ERROR attach_flags= VOL_IN_HASH | VOL_ON_VBYP_LIST nWaiters= 0 chainCacheCheck = 3 salvage = { prio = 0 reason= 0 requested = 0 scheduled = 0 } stats = { hash_lookups = { hi = 0 lo = 155 } hash_short_circuits = { hi = 0 lo = 0 } hdr_loads = { hi = 0 lo = 0 } hdr_gets = { hi = 0 lo = 0 } attaches = 0 soft_detaches= 0 salvages = 16 vol_ops = 1 last_attach = 0 last_get = 0 last_promote = 0 last_hdr_get = 0 last_hdr_load= 0 last_salvage = 1299019004 last_salvage_req = 1299018855 last_vol_op = 1299018890 } vlru = { idx = 5 (VLRU_QUEUE_INVALID) } pending_vol_op = 0 } Do you want the .vol file for this volume? on the fileserver? I have an idea on why you can't get the volume usable again, but I have no clue as to what the original inconsistency was that caused the first salvage. My suspicion is that a previous 'vos backup' left it in this state. The volume group hasn't been touched other than for backups in many months. I've never had a problem like this with that fileserver or volume until I upgraded from 1.4.11 to 1.6.0pre2. I would have included more snippets from FileLog as well, but I have >> the debug level turned up to try to track down a possible Then you should have some logs mentioning 'FSYNC_com' around 'Tue Mar 1 00:02:25 2011' explaining why we refused to give out the volume. (You don't perhaps
[OpenAFS] 1.6.0pre2 - more vos issues, possible bug
I have comments interspersed with log file snippets in a plain text file here: http://users.bx.psu.edu/~phalenor/problem I'm not sure what led to the initial problems with the .backup volume. We vos backup every volume every night. This has happened at least once at work on Solaris 10 x86 with a .backup volume, as seen above, and at least once on one of my home machines on 64bit linux with an RO clone. I would have included more snippets from FileLog as well, but I have the debug level turned up to try to track down a possible authentication bug (where tokens no longer work against a 1.6.0pre2 fileserver, but are fine against other 1.4 fileservers - more on that after I've gathered more evidence). --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Revival: Recommended way to start up OpenAFS on Solaris 10?
On 2011-02-21 at 16:36, Jeff Blaine ( jbla...@kickflop.net ) said: Best I can tell, the thread ended with this message from David Boyes @ SNA: http://www.openafs.org/pipermail/openafs-info/2010-January/032816.html Anything? Anyone? Did we get anywhere? Just looking to snarf someone's SMF stuff that works. https://github.com/phalenor/openafs-smf I have a feeling those are less than correct, but it's a start. I've had issues with the server manifest a few times, when it comes to shutting down or restarting. Feel free to push any changes you end up making. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] pam_afs_session in Fedora?
On 2011-02-18 at 12:33, Ken Dreyer ( ktdre...@ktdreyer.com ) said: On Fri, Feb 18, 2011 at 12:19 PM, Brandon S Allbery KF8NH wrote: On 2/18/11 14:14 , Andy Cobaugh wrote: Just curious why you're not just using the stock pam_krb5? At least in a plain jane krb5 environment, pam_krb5 has worked fine for us (though I haven't tried very recent Fedora). There are programs which don't do PAM right; in particular, they run pam_krb5 in root's context instead of the user's context, which worst-case results in a UID-based (no PAG) root token and no user token. This works fine with krb5 if they do it right, but the token is a side effect that can't be corrected in the session module. Right, I want PAG support and the other benefits of pam_afs_session. RedHat's pam_krb5's AFS support is not very good. In addition to not granting PAGs, I've seen situations where it will check if AFS is running, and if so, it attempts to convert the user's Kerberos 5 credential to a Kerberos 4 credential. This will time out because it cannot find the Kerberos 4 KDCs (none exist). Logins were taking a minute or more in these cases. Setting "ignore_afs" solved the problem. I can log in with pam_krb5, and I get put in a keyring-based PAG. I do see that the krb4_* options are no longer available in f14. In any event, would definitely welcome pam_afs_session in EPEL, at least our PAM configurations would be somewhat similar across platforms. --andy
Re: [OpenAFS] pam_afs_session in Fedora?
On 2011-02-18 at 11:16, Ken Dreyer ( ktdre...@ktdreyer.com ) said: I would like to try to get Russ's pam_afs_session into Fedora/EPEL. Since OpenAFS itself is not permitted for inclusion (I think it's because "no kernel modules"?), I'm hoping that there will still be utility to at least having pam_afs_session available. It won't be built with openafs-devel, but I don't think that's a problem, right? I've tested building in mock without depending on AFS at all, and it seems to work. Just curious why you're not just using the stock pam_krb5? At least in a plain jane krb5 environment, pam_krb5 has worked fine for us (though I haven't tried very recent Fedora). --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: [OpenAFS-announce] OpenAFS 1.6.0 release candidate 2 available
So far so good. Deployed 1.6.0pre2 on 64 and 32 bit CentOS. Clients with disk cache and memcache, as well as two DAFS fileservers. We've been running a mixture of 1.5.77, 1.5.78, and 1.6.0pre1 for some time now, on OSX, Solaris SPARC and x86, and 32/64-bit Linux, as both DAFS fileservers and clients with only a few bugs here and there, all of which seem to have been fixed with 1.6.0pre2 (mostly around vos). Ran some quick iozone tests on 1.6.0pre2 with client and server on the same machine over loopback on 64-bit CentOS. There don't appear to be any gross single-client performance regressions in that case. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] calculating memory
On 2011-01-28 at 22:38, Gary Gatling ( gsgat...@eos.ncsu.edu ) said: I am going to use RHEL 6 for the fileserver. I have a test VM up and working with openafs 1.4.14 to start with. Seems to work ok with ext4. The version of VMware we are using is VMware ESX. We pay full price for that. I think we are slowly moving to version 4 but now I think now its mostly 3. (We can use vmxnet 2 NIC but not 3 on most boxes so far) Sounds similar to what we do, except switch rhel for Solaris 10. You definitely want to use whatever the latest vmxnet drivers you have. This speeds the network up tremendously, or at the very least reduces CPU overhead. I think in Centos 5.5 64bit I couldn't get much more than ~300Mbps with the virtualized nic, but can get at least 800Mbps with vmxnet3. Similar results under Solaris. Only reason we use Solaris is for compression. With LZJB, we see almost 2:1 compression on our home directories, which are currently using about 4TB+ of our SAN storage, which really means we have closer to 6TB of actual home directories. LZJB uses hardly any CPU, and I'm sure in some cases it's faster to compress than to write to disk. Oh, and end-to-end checksums is a nice bonus too if you don't trust your underlying storage, even if it is fancy uber-expensive SAN storage (we don't do ZFS RAID, just zpools with a single vdev -> RAID5 LUN). We currently run 3 such fileserver VMs on VMware ESXi 4.x on the same box, 2 vCPUs each (fileserver will barely use 2 CPUs, so factor in that plus a CPU for volserver when doing vos moves). Each of those VMs has 2GB of memory assigned to it right now, and that seems to be enough even with ZFS in play. If I'm reading the output from ps correctly, one of our larger DAFS fileservers running on Centos 5.5 64bit is using 1.8GB, davolserver 1.5GB. (That's with -p 128 to both commands, so actual memory usage is probably much smaller than that). It seems like on Solaris 10 with openafs 1.4.11 the server seems to use about 1 GB when its not backing up. I am not sure how much it uses at "peak times" or when doing full backups. And I don't have the new backup software (yet). Teradactyl is the backup software we are switching to to ditch Solaris for Linux. Just to add another datapoint to the mix, we use TSM (provided by our university's central IT), and just do file-level backups. At least that way we're server agnostic (though it's not the fastest solution by a longshot - the TSM server is the bottleneck in our case, so there wasn't any point in choosing a faster backup strategy). I'm curious - how are you backing up AFS now? I gather real servers aren't an option `cause management really likes moving most everything into VMware. We already moved all our license and web servers into VMware and we have some other weird servers working in it also. Even Windows infrastucture like domain controllers and stuff. If everyone says its a bad idea I can make an argument though. :) Eh, if you push your data onto these virtualized servers and performance takes a hit (we'll sometimes see sporadic slowdowns when vos moves are happening on the same ESX host), then obviously you can try to take the "I told you so" approach and get some bare metal hardware to compare things to. Oh, and we also do raw device maps in ESX. I haven't quantified how much faster raw device maps are over -> storage> -> VMXFS -> SAN, but being able to access that LUN from a non-ESX box and see ext4 instead of VMXFS sounds like the makings of a good DR strategy. One more thing: SAN raw device maps in ESX 4 are limited to 2TB. I guess the hypervisor is still using Linux 2.4, and there are some limitations from Linux itself in play there. You can create a VMXFS bigger than 2TB by using multiple extents (I think). iSCSI doesn't have this limitation. Just something to be aware of. I would be very curious to see any benchmarks you come up with. Things like iozone on the vicep itself, iperf between VMs on the same vSwitch, between VMs on different hosts, etc. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] PTS membership (or existence) based on external data?
On 2011-01-21 at 11:36, Stephen Joyce ( step...@physics.unc.edu ) said: Hello, Has anyone written a script or utility to add/remove PTS entries (either membership in PTS groups or actual existence of the PTS user account would be acceptable) from an external database, based on date? My AFS cell is in the middle of transitioning from authenticating against a departmental KRB5 realm to authenticating against a central University-wide KRB5 realm. I'd like to be able to continue to have the ability to expire students' access to resources automatically--when their affiliation with the Department expires: at the end of a semester, research project, etc. So I thought I'd ask if anyone has an in-house tool, querying expiration dates from an external source such as a non-authoritative KDC, SQL, etc) and is willing to share, before I possibly reinvent the wheel. This is what we use: https://github.com/phalenor/ldap2pts It's not perfect, is very specific to our site, has at least one bug that needs to be fixed (owner of user:group groups needs to match the username), screen scrapes all of the pts commands, is an example of some non-ideal Perl programming, and won't scale too well. We run it once every 10 minutes, but we only have 259 accounts and 92 groups, so it may only take on the order of 30 seconds to run (on a SunFire V100). I wanted to add support for parsing the output of an openldap accesslog so it syncs in almost real-time and doesn't have to compare all of ldap against all of pts. Anyway, might give you some different ideas. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] k5start, AFS and long-running daemon
On 2011-01-17 at 16:17, Stephen Quinney ( step...@jadevine.org.uk ) said: I am having some problems with trying to use k5start to maintain a kerberos credential cache for a long-running daemon. In particular, it's maintaining the AFS tokens which is problematic. I noticed on http://www.eyrie.org/~eagle/software/kstart/todo.html, the following comment on the k5start todo list: "Add a flag saying to start a command in a PAG and with tokens and then keep running even if the command exits. This would be useful to spawn a long-running daemon inside a PAG and then maintain its tokens, even if k5start and the daemon then become detached and have to be stopped separately." I have a daemon which detaches but which needs to access AFS directories. Running k5start in the background works great for maintaining the kerberos cache (which is also needed for DB access) it's just AFS which is causing problems. So this sounds like exactly what I need to do, given that this isn't currently possible with k5start can you suggest the best way to go about achieving the same thing? Just start the whole thing inside pagsh. Then we use these options to k5start: /usr/bin/k5start -b -K 10 -l 14d -p /var/run/$prog-k5start.pid -f $keytab -k $ccname -t $princ2 Where $keytab is obvious, ccname = /tmp/krb5cc_k5start_wrapped-$prog $princ2 = -U or $print@$realm (depending on k5start version). That's taken almost directly from our k5start-wrapper script, which we use to wrap init scripts under /etc/init.d/. You create /etc/init.d/$prog-afs, set a couple of variables like $keytab, then source k5start-wrapper. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] volume size
On 2011-01-14 at 11:43, Lewis, Dave ( le...@nki.rfmh.org ) said: Hi, I'm wondering what is a reasonable size for large AFS volumes. I understand that the maximum size of a volume is about 2 TB (assuming that the partition is at least that size). From a practical standpoint, is it reasonable to have a 2 TB volume? Should I expect any problems doing operations like bos salvage or vos move on large volumes? We've been running with some data volumes in the TB range for a while now without problems. Biggest volume right now is ~3.2TB. Splitting these large volumes isn't very practical. vos move will seem to take forever, but we've moved TB scale volumes without any problems. You'll find, however, that around 2TB, some tools will start to report negative numbers for volume size, and you can't set a quotas bigger than 2TB, so you get to set the quota to 0, disabling it entirely. For example, I'm wondering if bos salvage has a "harder" time with a few large volumes than with several smaller volumes. I figure that, with smaller volumes, internal inconsistencies that bos salvage fixes would be more isolated than with large volumes, and that that would be beneficial. But I don't really know. salvages will take longer certainly, but it hasn't had any problems in my experience. Currently we mount 25 GB volumes in users' home directories for their image data, which grows a lot during data processing. Some users are starting to feel limited by 25 GB volumes, so I'm considering going to 100 GB volumes. I would appreciate any advice. Should large volumes be salvaged more often than small volumes? Gee, how big are their homedirs then ;) Ours start out with a 50GB quota, with some hovering somewhere between 100-200GB. Our rule for allocating space is, don't give them too much right away, or they'll use it up in no time. When most people hit their quota, they naturally find ways to work within their quota without having to ask for more space ;) --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Re: [OpenAFS-announce] OpenAFS 1.6.0 release candidate 1 available
On 2011-01-06 at 16:55, Derrick J Brashear ( sha...@openafs.org ) said: Please assist the gatekeepers by deploying this release and providing positive or negative feedback. Bug reports should be filed to openafs-b...@openafs.org . Reports of success should be sent to openafs-info@openafs.org . This needs to get applied to 1.6: http://git.openafs.org/?p=openafs.git;a=commit;h=97474963e58253f8c891e9f6596403213d53527b --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Package Management in AFS
On 2010-12-20 at 19:34, Dirk Heinrichs ( dirk.heinri...@altum.de ) said: Am 20.12.2010 19:26, schrieb Booker Bense: My 2 cents... Outside of a few very specialized apps, putting software in AFS is a losing proposition these days. Since local disk space is growing so fast, there really is little justification for not simply using the package management system of the OS and simply installing locally. Can't agree more. We use stow to install certain pieces of software into AFS, usually one-off and standalone scientific software (we're in bioinformatics). For everything else, we use the package manager. RPMs really are easy to make. Perhaps even easier than installing the same app in AFS. Even if there was something like rpm for afs, that would only make the two methods (installing on local disk or installing in afs) equivalent (ignoring any issues of permission). This also assumes you're running the same version of the same OS everywhere (for example, we use @sys symlinks, but in our environment amd64_linux26 isn't the same everywhere). Follow the principal of least work: Is it more work to install an app into AFS, or yum/apt-get/etc install. That would again mean that the sw had to be installed over and over again, on every single machine. That may be OK for 2 or 5 machines, but for a larger number this becomes a tedious task. And what about diskless clients? That's what cfengine or puppet are for. IMO, any time you have to manage 2 or more machines, you really do need something like cfengine to do complete configuration. If you can't blow away entire machines and have them automatically reinstall and converge back to their previous state, then you're really not managing your systems. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] missing /etc/sysconfig/openafs-client
On 2010-10-18 at 13:15, David Bear ( david.b...@asu.edu ) said: Indeed there is a /usr/vice/etc/cacheinfo. what concerns me as that this set of rpm's has different configuration files than the 1.4.10 rpms that I used to use. The /etc/init.d/openafs-client file sources /etc/sysconfig/openafs instead of /etc/sysconfig/openafs-client. Really? I just pulled down the openafs-client 1.4.10 RPM for RHEL-5 from dl.openafs.org, and the /etc/init.d/openafs-client that's included sources /etc/sysconfig/openafs, and uses the AFSD_ARGS variable for options to afsd (things like -memcache, -daemons, -dynroot, -afsdb, etc etc). In fact, if we look at the git history for src/packaging/RedHat/openafs-client.init, it seems like it has always sourced /etc/sysconfig/openafs, which is installed by the openafs package proper. Was the 1.4.10 you were running before downloaded from openafs.org originally? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] forcing and update with yum for openafs
On 2010-10-14 at 11:08, David Bear ( david.b...@asu.edu ) said: I removed openafs 1.4.10 -- and installed the 1.4.12 repository rpm. However, when I try to do a yum install openafs it still wants to grab the 1.4.10 version. I look in /etc/yum.repos.d and see the openafs.repo file there that points to 1.4.12 .. But I am at a loss as to how to force yum to use it. Could it be caching something somewhere? Try a 'yum clean all' ? What happens if you say 'update' instead of 'install' ? It could be that you didn't remove every openafs package, and so a lingering package from 1.4.10 is specifying openafs = 1.4.1 as a dependency. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs Client with pam krb5 and ldap
On 2010-10-01 at 21:50, Russ Allbery ( r...@stanford.edu ) said: Russ Allbery writes: Oh, I understand now. pam_unix fails, and you were expecting pam_krb5 to return success (blindly) to counter pam_unix's failure, but since pam_krb5 (correctly) returns PAM_IGNORE for users about which it has no information, logins are failing because of the pam_unix failure. Or, if you remove pam_unix, because all modules in the stack returned PAM_IGNORE. Oh, and the other piece I forgot to mention: you saw this start happening in lenny because in etch pam-krb5 did blindly return PAM_SUCCESS if the user didn't log in with a password. This was changed in 3.11: pam_setcred, pam_open_session, and pam_acct_mgmt now return PAM_IGNORE for ignored users or non-Kerberos logins rather than PAM_SUCCESS. This return code tells the PAM library to continue as if the module were not present in the configuration and allows sufficient to be meaningful for pam-krb5 in account and session groups. Yeah, I think I remember reading that. On redhat, account uses pam_unix, pam_krb5, then pam_permit after running authconfig and telling it to use ldap and /etc/passwd for authZ, and krb5 and /etc/shadow for authN, so I think pam_permit may be the right way to go. Thanks for clearing this up. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs Client with pam krb5 and ldap
On 2010-10-01 at 20:30, Russ Allbery ( r...@stanford.edu ) said: pam_permit of course fixes it because it basically disables the entire account stack. Just deleting everything out of the account stack would presumably also fix it. The account stack needs /something/ in it or it fails completely. I wonder if pam_krb5 is a red herring here and what's actually failing is pam_unix. Do the accounts you're trying to log in as exist in /etc/shadow? Does it work if you remove pam_krb5 and only keep pam_unix? pam_unix does require all accounts be present in /etc/shadow. These accounts exist through ldap, so no entries in /etc/shadow. It fails in the same manner with just pam_krb5. pam_krb5 and pam_permit together work. Is your pam_krb5 returning nothing for pam_sm_acct_mgmt with gssapi ssh logins perhaps? --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs Client with pam krb5 and ldap
On 2010-10-01 at 19:03, Russ Allbery ( r...@stanford.edu ) said: account [default=ignore ignore=ignore success=ok] pam_krb5.so debug That doesn't look like anything that would ever be generated by default and it isn't in the docs. I wonder if that's causing your problem. PAM stacks can sometimes do really strange things if you set ignore as the action and it's the last module in the stack. Well, I didn't put it there, so something did. I've seen that line on systems that were originally lenny, and systems that were upgraded to lenny. account requiredpam_unix.so account requiredpam_krb5.so Yep, putting exactly those lines in common-account gives 'Connection close by foo'. I'm fairly certain I tried every combination of requires, sufficient, etc to the same effect. Only thing that made it work was putting in a module that returns success always, like pam_permit. I have to assume there's something really screwy with how something on your systems is set up or something about the too-complex PAM configuration isn't working properly, since this just works out of the box with me with supposedly the same versions of everything. Well, one system was installed with lenny to begin with, and for over a year we would have to do GSSAPIAuthentication=no to login to it, and the other 2 systems were upgraded to lenny, which subsequently broke gssapi logins in the same manner. Putting in pam_permit on all 3 systems fixed them. Doesn't matter to us so much now, though. At least 2 of these systems will be reinstalled with RHEL6 when it comes out, and the third isn't used by anyone ssh'ing to it that can make use of gssapi, so... If you have any other things you want me to try, I will for the sake of fixing whatever the real problem is, but no other system has this problem, only these lone debian lenny systems. I should note that the 2 systems that were upgraded were working fine before they were bumped up to lenny. All 3 are running the same version of libpam-krb5, 3.11-4. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs Client with pam krb5 and ldap
On 2010-10-01 at 10:04, Russ Allbery ( r...@stanford.edu ) said: Andy Cobaugh writes: Two, I'm guessing this is debian? No, it's not Debian, although the common-* stuff made it look that way. But that's the Red Hat pam_krb5. I've had issues making this work with GSSAPI on lenny, and have an account section like this: account sufficient pam_permit.so debug account requiredpam_unix.so debug I spent a great deal of time fighting this when we upgraded the couple remaining debian machines here to lenny. windlord:~> cat /etc/pam.d/common-account # /etc/pam.d/common-account -- Authorization settings common to all services. account required pam_krb5.so account required pam_unix.so So I'd be very curious to hear more about what's breaking for you, since this should just work. (I'm the author of the pam-krb5 module used in Debian.) Sure, debian lenny. libpam-krb5 = 3.11-4 libpam-afs-session = 1.7-1 If I have just this in common-account: account requiredpam_unix.so debug Then I try to login with gssapi ssh: $ ssh foo Connection closed by x.x.x.x Only entry in auth.log: Oct 1 13:09:23 apollo sshd[25687]: Authorized to phalenor, krb5 principal phale...@bx.psu.edu (krb5_kuserok) So we add in pam_krb5 in system-account like this, which appears to be the default entry added when pam-krb5 is installed: account [default=ignore ignore=ignore success=ok] pam_krb5.so debug And same thing, connection closes. This is in auth.log: Oct 1 13:10:59 apollo sshd[25718]: Authorized to phalenor, krb5 principal phale...@bx.psu.edu (krb5_kuserok) Oct 1 13:10:59 apollo sshd[25718]: (pam_krb5): none: pam_sm_acct_mgmt: entry (0x0) Oct 1 13:10:59 apollo sshd[25718]: (pam_krb5): none: skipping non-Kerberos login Oct 1 13:10:59 apollo sshd[25718]: (pam_krb5): none: pam_sm_acct_mgmt: exit (failure) Only way to make it work is to add in pam_permit. I've done things like run sshd with the highest debugging level, among other things, and nothing I've done shows any indication that it's even failing. I think I deduced that the account routine wasn't returning success, so I tried pam_permit, it worked, and I stopped caring why. I've seen this on every single lenny system I've installed (so, maybe 3?). The defaults never worked for me. common-[auth|session|password] are all stock otherwise, and /etc/pam.d/sshd directly includes all for common-* files. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] scripts to install openafs
On 2010-10-01 at 11:52, Jonathan S Billings ( jsbil...@umich.edu ) said: When installed, along with the "dkms" package (http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/dkms.html) and the GCC compiler, it will automatically compile and install a new openafs.ko every time you install a new kernel. ... or it may not. I've seen dkms fail to build openafs.ko too many times. After so many users complaining that they didn't have a home directory after they turned their machine on and couldn't log in, we gave in and switched to using the kmod-openafs package. Others may have other experiences with dkms, but it's certainly an option if it works for you. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Openafs Client with pam krb5 and ldap
On 2010-10-01 at 17:46, Claudio Prono ( claudio.pr...@atpss.net ) said: /etc/pam.d/common-account account requisite pam_unix2.so account requiredpam_krb5.so use_first_pass ignore_unknown_principals account sufficient pam_localuser.so account requiredpam_ldap.so use_first_pass One, if you're using LDAP for user/group info (as configured through nsswitch.conf), LDAP never plays into PAM, so you don't need pam_ldap anywhere. Two, I'm guessing this is debian? I've had issues making this work with GSSAPI on lenny, and have an account section like this: account sufficient pam_permit.so debug account requiredpam_unix.so debug I spent a great deal of time fighting this when we upgraded the couple remaining debian machines here to lenny. Others can most likely provide more help than that, just though I'd mention the issue with the account section in case that ends up being a problem for you. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] scripts to install openafs
On 2010-10-01 at 08:39, David Bear ( david.b...@asu.edu ) said: It seems that once a year I end up getting a kernel update that breaks afs and then I need to install again... but with a fixed kmod -- or something else. I know a lot can and should be scripted -- but I've never taken the time. So I was hoping someone on the list may have a scripted install of afs for a Red Hat system -- actually, I run CentOS but they are the same enough that it shouldn't matter. Preferably, the script should look at the currently running kernel, then know enough to grab the latest stable openafs rpms and install them -- A followup script would be nice the would either update the kmod or the dkms afs package as well... Any pointers, code, adivce would be helpful. Here's what we do, which isn't quite what you're asking for, but might give you some ideas. First, We use cfengine2 to handle all of our config management. We have everything set up in such a way that cf2 will do /everything/ necessary after the initial kickstart, where %post installs cfengine and pulls down a basic update.conf to bootstrap cf2. We publish our own yum repos. We exclude=kernel* in the Base and Updates repos. We then place corresponding kmod-openafs and kernel packages in the repo. In this way, we can handle exactly when kernels are updated, and can be sure that we'll never be in a situation where there aren't kmod-openafs packages yet for new kernel packages. We've been in that situation before, it sucks ;) We then have something like this in cfengine: classes: centos:: # do we have openafs.ko for the running kernel? app_openafs_has_module = ( ReturnsZeroShell(/sbin/modinfo openafs >/dev/null 2>&1) ) app_openafs_has_kmod_installed = ( ReturnsZeroShell(/bin/rpm -q kmod-openafs >/dev/null 2>&1) ) shellcommands: # dangerous: if kmod-openafs is installed but modinfo openafs returns nothing, # assume kmod-openafs was not installed for the currently running kernel and reboot # with the hopes of booting into a kernel with openafs.ko centos.app_openafs_has_kmod_installed.!app_openafs_has_module:: "/sbin/shutdown -r +5 \"cfengine\: reboot in 5 minutes to try and fix openafs\"" ifelapsed=10 useshell=true That just handles the state after initial install. There are other fairly standard entries in the editfiles, copy, and packages section to make sure everything else (ThisCell, /etc/sysconfig/openafs, etc) are in the desired state. As far as rebooting machines to upgrade kernels, people reboot our workstations often enough that there's a pretty good chance they're running the latest installed kernel. Otherwise we know for sure when there's a new kernel to upgrade to, as we'll have dropped the appropriate packages in the yum repo, so we can do various things to reboot machines after they've done their nightly updates, assuming nobody is logged in. Hopefully that gives you /some/ idea of how one site handles openafs on centos. We've gotten away from using scripts to handle things in favour of doing things the cfengine way: defining the desired system state, and allowing the machines to converge on their own. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Overview? Linux filesystem choices
On 2010-09-30 at 21:00, Robert Milkowski ( mi...@task.gda.pl ) said: On 30/09/2010 15:12, Andy Cobaugh wrote: I don't think anybody has mentioned the block level compression in ZFS yet. With simple lzjb compression (zfs set compression=on foo), our AFS home directories see ~1.75x compression. That's an extra 1-2TB of disk that we don't need to store. Of course that makes balancing vice partitions interesting when you can only see the compression ratio at the filesystem level and not the volume level. Checksums are nice too. There's no longer a question of whether your storage hardware wrote what you wanted it to write. This can go a long way to helping to predict failures if you run zpool scrub on a regular basis (otherwise, zfs only detects checksum mismatches upon read, scrub checks the whole pool). So, just to add us to the list, we're either ext3 on linux for small stuff (<10TB), and zfs on solaris for everything else. Will probably consider XFS in the future, however. Why not ZFS on Solaris x86 for "smaller stuff" as well? That's just the way things have worked out over the years. "smaller stuff" tends to be older machines that were here when I started, and a couple of those have hardware raid controllers (like, 3ware pata, for example), that will be decom'd soon. There are also cases where the machine with the storage attached to it also needs to be used interactively by people (like, a PI wants a new machine to run stuff on, but also wants 10TB, which we set up as a vice partition so they can access it from any machine). Solaris is great for storage if that's all you use it for, but anything else gets to be a pain when people start asking for really weird and complicated stuff to be installed. If I were doing everything over again, we would eliminate all of the storage islands, and run all the storage through solaris. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Overview? Linux filesystem choices
I don't think anybody has mentioned the block level compression in ZFS yet. With simple lzjb compression (zfs set compression=on foo), our AFS home directories see ~1.75x compression. That's an extra 1-2TB of disk that we don't need to store. Of course that makes balancing vice partitions interesting when you can only see the compression ratio at the filesystem level and not the volume level. Checksums are nice too. There's no longer a question of whether your storage hardware wrote what you wanted it to write. This can go a long way to helping to predict failures if you run zpool scrub on a regular basis (otherwise, zfs only detects checksum mismatches upon read, scrub checks the whole pool). So, just to add us to the list, we're either ext3 on linux for small stuff (<10TB), and zfs on solaris for everything else. Will probably consider XFS in the future, however. If you do use ext3, I find it helps sometimes to turn off atime. It might be interesting to see what other options, if any, other folks are using for ext3. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] CellServDB
On 2010-06-18 at 20:08, Mattias Pantzare ( pant...@ludd.ltu.se ) said: But maybe the CellServeDB is not really the problem, the problem is that the client will list all sites in it by default. What if we just changed the default to not list sites other than the default site (that the installation program prompts for)? Or instead of asking OpenAFS to do this, you could do this yourself, using a configuration management system or series of shell scripts to make sure the CellServDB on all of your clients contains only entries you care about. If you're not already using such a system, perhaps you should be. Any problems beyond that are then a result of the software in question assuming everything under / is 'local' and 'fast'. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Modifying the output of vos commands to include server UUIDs
On 2010-04-14 at 11:23, Jeffrey Altman ( jalt...@secure-endpoints.com ) said: On 4/14/2010 10:51 AM, Steve Simmons wrote: On Apr 13, 2010, at 5:28 PM, Jeffrey Altman wrote: I'm a long-time fan of having a switch that causes tools to dump their data in an easy-to-machine-parse format. That isn't always doable, but when it is, it's a big win. As Andrew pointed out in another reply in this thread, the -format switch is support to provide that but it fails to provide a consistent (value - data) pair per line. Exactly my point. We currently snarf off all that data nightly via script that parses the output from vos e -format. It works but was a pain. Note, tho, that some data doesn't adapt well to single-line output. For example, just doesn't map well to single-line. We currently deal with this by creating four records, each will all the data from the rest of the output and the specifics of the four entries above. Steve Anyone want a -xml option? Yes, please. As much as I am not a fan of XML, it would make some of our lives easier for those of us using languages that include xml parsers. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Linux packages for 1.5?
On 2010-04-07 at 19:38, Simon Wilkinson ( s...@inf.ed.ac.uk ) said: Those of us actively developing on Linux have been running the 1.5 series for ages. The fact that other people are seeing problems would seem to indicate that testing across a wider variety of systems is required. Unfortunately, we don't have the time, or the systems, to do this by ourselves. If folk are interested in getting a stable 1.5 (and 1.6) for Linux any time this millenia, then we need more people testing the builds. This particularly applies to those running old, or non-standard kernels and running on odd platforms and architectures. One of the bugs I fixed for Russ surfaced exactly because he was running a kernel with slightly out of the ordinary memory management. If RPM packages would help with this please let me know. So far all I have heard is silence. Yes, please - if not for every 1.5.x release, at least for the ones you want tested. I'm probably in a position here where I could start pushing 1.5 onto certain desktops/workstations around here, and having ready-made RPMs for 1.5 will make that task that much easier. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Maximum size of AFS volume
On 2010-03-18 at 08:55, jonathan.whee...@stfc.ac.uk ( jonathan.whee...@stfc.ac.uk ) said: Can someone say what would be maximum possible size of an AFS volume ? Are there any limits imposed by AFS, or are the limits filesystem/partition dependent ? To be more specific, I am talking about a volume of up to 1 Tb (1000 Gb). Biggest volume we have right now is ~3TB. I wouldn't recommend going much more than 1TB, though. Bigger you go, longer it will take to do certain operations, like vos move. This volume in particular is slated to be split into smaller volumes in the not-so-distant future (the data doesn't easily lend itself to that without having mountpoints in weird places, for example, hg18 [human genome] is itself 2.2TB). As Derrick mentioned, quotas only work up to 2TB. Any more than that and you will have to disable the quota for the volume by setting it to 0. Also, certain tools may report odd numbers when you go beyond 2TB, mostly reporting large negative values - though this might be fixed in new enough versions, I can never keep track of this stuff. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] OS X 10.5 and kerberos ssh logins
On 2009-07-29 at 14:07, Adeyemi Adesanya ( y...@slac.stanford.edu ) said: Hi There. We've had a long standing issue with OS X 10.5 (Leopard) and I just wanted to check with folks to see if anyone has solved it. We are able to perform Kerberos SSH logins to 10.5 clients using the SSH GSSAPI options GSSAPIAuthentication and GSSAPIDelegateCredentials. As long as I have a valid kerberos ticket, I can log into my 10.5 systems without supplying a password. However, there does not appear to be any sign that the forwarded kerberos ticket is cached on the remote system. As a result, I cannot obtain an AFS token automatically. This was working for us under 10.4 but we have not found a solution for 10.5. Looks like the problem still exists for 10.6 too. Use the sshd from macports. Apple's sshd is trying to use their credential caching mechanism, which would appear to store the credentials in your home directory, which if it's in AFS obviously won't work. Are you able to login at all _without_ GSSAPI, i.e. with a password? We're unable to, and that's the only major problem we're still seeing. Although come to think about it, this might be alleviated if we use Russ's pam_krb5, hmm... --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] PAG garbage collection on linux
I recently ran into a situation where I had a process on linux that gets pags probably 10 times per minute on linux (dovecot, fwiw). After 3 weeks of uptime after upgrading afs to 1.4.7, we had to reboot the machine, as we were no longer able to create new pags. This is on debian stable, kernel 2.6.18-6-amd64, and afs 1.4.7 as previously mentioned. /proc/sys/afs/GCPAGS gets set to 8, which from my understanding means it was unable to walk the process tree. I see that it works with 1.4.7 on fedora 9 with a 2.6.25 kernel. I have since reworked my dovecot setup so that imap logins don't needlessly get tokens and create a pag (the entire dovecot process stack already runs in a pag with tokens for a pts user with appropriate access). I am curious what combinations of linux kernel / openafs that GC actually works. This would be helpful to know, and I'm sure others would benefit from this as well. Thanks. --andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Summary of recommended configuration options from the workshop
Correct me if I'm wrong, but I seem to recall someone mentioning that there were certain cases when running fastrestart where volumes might end up being attached even if they need salvaging, leading to data loss/corruption? I would say any benefit you see in running fastrestart would be taken over by the chance that you could lose entire volumes to such a bug. I say the sooner we can get DAFS / 1.5 stable the better. DAFS should make your fileservers restart real fast too. -- Andy ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info