Re: [OpenAFS] replacement for depot?

2012-09-26 Thread Andy Cobaugh

On 2012-09-26 at 18:20, Jason Edgecombe ( ja...@rampaginggeek.com ) said:

Hi everyone,

I'm using a program called "depot", which I think used to be included with 
IBM/Transarc AFS. I'm planning to migrated from RHEL5 with cfengine to RHEL6 
with puppet. I use depot to manage  many folders of symlinks. What would you 
recommend as a replacement to depot?


FYI, the fsi_generate commands is used to generate a depot.image file that 
the depot command uses to build the symlink farm.


I've used GNU stow with some wrapper scripts for volume creation, 
replication, mounting, permissions, etc, to manage software installation 
into AFS.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] sysname for 3.x linux kernel

2012-03-08 Thread Andy Cobaugh

On 2012-03-08 at 15:49, Dave Botsch ( bot...@cnf.cornell.edu ) said:

I just set the sysname to whatever I want it to be. Cfengine sticks a
"/usr/bin/fs sysname -newsys" command in the /etc/init.d/openafs-client
script in the startup) section


Interesting. What are you setting the sysname to in that case? Anything 
cfengine-specific?


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Feature wish: remove partition while fileserver keeps on running

2012-02-27 Thread Andy Cobaugh

On 2012-02-27 at 17:00, Lars Schimmer ( l.schim...@cgv.tugraz.at ) said:

Hi!

Maybe I missed a point or two, but I wish I could remove and unmount a 
/vicepX partition while fileserver keeps on running.


The last weeks I needed to redo our iSCSI storages and that implies a lot of 
mount/unmount/redo partitions of our OpenAFS fileservers.
Each time I need to add/remove a partition, the safe way was to stop the 
OpenAFS fileserver, mount/umount the partition and restart the fileserver.
As I do not want to be the night owl, I did it in usual work shift - which 
did annoy our users as service was broken a few minutes.


Is there any way to do this a better way?

(IMHO DAFS is only for volumes, not partitions, or?)


That is correct.

However, DAFS can make the current methods of adding/removing /vicep's a 
bit less painful.


In the past, what I've typically done to remove paritions is completely 
evacuate them with vos remove/move etc, unmount the partition, then bos 
restart  dafs. Unmounting a partition while a fileserver might still 
be accessing it is risky, but if you're absolutely sure that there's 
nothing left on it, then this is more or less safe IMO. You could also bos 
shutdown / unmount / bos startup if you want to be paranoid.


Adding partitions is easier. Mount the new partition, then restart.

With DAFS, restart times are extremely fast, and I believe callback state 
is preserved across restarts, so your clients shouldn't notice the restart 
if everything is working correctly.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] improving cache partition performance

2011-08-29 Thread Andy Cobaugh

On 2011-08-29 at 19:39, Jason Edgecombe ( ja...@rampaginggeek.com ) said:
I was told that noatime is bad for an AFS cache partition because AFS uses 
the atime to know when the cache entry was last accessed.


Oops, looks like you're right, unless someone more knowledgeable says 
otherwise.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] improving cache partition performance

2011-08-29 Thread Andy Cobaugh


Just to add some more datapoints to the discussion...

Our webservers are HP DL360G5's, 14GB RAM. Pair of 36GB 15K 2.5" SAS 
drives in RAID-1. /usr/vice/cache is ext2 with noatime,nodiratime. These 
machines run dovecot IMAP, apache with lots of php applications, RT, and 
vsftpd serving anon and private ftp accounts.


Serving content that isn't in the cache yet, we can get about 70-80MB/s 
depending on which fileserver it's coming from, and after it's cached, the 
gigabit network becomes the bottleneck. The cache partition is ~34GB in 
size, and we're running with these options:


-dynroot -dynroot-sparse -fakestat -afsdb -nosettime -daemons 20
-stat 48000 -volumes 2048 -chunksize 19 -rxpck 2048

With those cache manager settings, cache partition utilization is sitting 
at about 92%. I can get even better numbers with memcache, and indeed most 
of our other machines are running with 2GB of memcache, I like seeing read 
performance in GB/s, and when most of your machines have 32GB or more (we 
have 3 with 256GB), a couple GB here and there won't have a noticeable 
impact.


Jason: do you know in particular what kind of workload is causing issues 
for you? You mentioned your wait times are on the order of seconds, are 
you sure that's caused by the underlying disk? At the very least, I would 
try mounting your cache partition as ext2, as has already been suggested. 
Turning off atime and diratime shouldn't hurt, and if your disks are 
having issues with seaks, this should help some.


Also, you really want to run 1.6.0pre7, or 1.6.0 when it shows up. Nothing 
wrong with 1.4, but if you're trying to get the most performance out of 
afs on modern hardware, switching to 1.6 gives you some real cheap gains. 
There are huge performance improvements on Linux going from 1.4 to 1.6, 
and all of my new installations are 1.6.0pre7 for that reason. Especially 
with disk-based caches, as Simon mentioned. 1.6.0pre7 gets write 
performance for disk caches almost on par with memcache, though read 
performance is still lacking, as memory will almost always be faster than 
disk, but disk will always be 'cheaper' than memory.


Worth a try at least, and pre7 has been very stable in our environment.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Solaris 10 deadlock issue

2011-06-17 Thread Andy Cobaugh

On 2011-06-17 at 12:07, Andrew Deason ( adea...@sinenomine.net ) said:

On Fri, 17 Jun 2011 13:01:33 -0400 (EDT)
Benjamin Kaduk  wrote:


This issue sounds rather similar (superficially, at least) to one
we've been seeing on FreeBSD clients.  When you say that "something
has changed ...", is that something you think is OS-specific AFS code,
OS code, or generic AFS code?


Something has changed in the Solaris kernel, since this problem does not
occur with earlier versions of Sol10 u8.


Can someone summarise which kernel versions / solaris updates and openafs 
versions are affected?


Is there any combination of the openafs client and u9 that works right 
now?


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] OpenAFS 1.6pre5 Ubuntu ppa

2011-06-14 Thread Andy Cobaugh

On 2011-06-14 at 14:00, Nicolas Bourbaki ( ncl.bourb...@gmail.com ) said:

Hi guys,

I'd like to know if the Ubuntu ppa repository has been upgraded to
offer the latest 1.6pre6 version of OpenAFS.
I'm having odd behavior when using the latest version available on the
following ppa:
 - http://ppa.launchpad.net/openafs/master/ubuntu/pool/main/o/openafs/
 - openafs-client_1.6.0~pre5-2~ppa0~maverick1_i386.deb

When opening my desktop session (user dirs on AFS), I have the following:
afs: Lost contact with file server XXX.XXX.XXX.XXX in cell yyy.yyy
afs: Lost contact with file server XXX.XXX.XXX.XXX in cell yyy.yyy
afs: failed to store file (110)
afs: failed to store file (110)
afs: failed to store file (110)
afs: file server XXX.XXX.XXX.XXX in cell yyy.yyy is back up
afs: file server XXX.XXX.XXX.XXX in cell yyy.yyy is back up


In case it's related, one of our users on debian wheezy was experiencing 
the same symptoms while trying to 'hg pull'. Problem didn't show up until 
the machine had been up for about 2 weeks, and went away after a reboot. 
Was going to wait and see if it shows up again before reporting it.


openafs-client-1.6.0~pre5-2
2.6.38-2-amd64

fstrace is here:
/afs/bx.psu.edu/user/phalenor/public/fstrace_dump.txt

If this isn't related to the problem above (not sure how close debian and 
ubuntu wrt openafs), I'll send a real bug report if/when my problem shows 
up again. But in case it is related...


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Expected performance

2011-05-20 Thread Andy Cobaugh

On 2011-05-20 at 14:10, Andrew Deason ( adea...@sinenomine.net ) said:

On Fri, 20 May 2011 13:51:05 -0400 (EDT)
Andy Cobaugh  wrote:


...or it's that you're writing to the same disk twice as much. If the
cache and /vicepX are on the same disk, it seems pretty intuitive that
it's going to be slower.


It was with memcache.


Well _I_ wasn't talking about memcache. :)


Of course. When I'm talking about performance, I'm almost never talking 
about disk cache ;)



mal and badger are slightly different hardware, but the tests above
show that we get very similar performance between all server and
client combinations except the case where client and server are on the
same machine.

Maybe I'm missing something here?


I think to some extent it can still be that they're just using the same
hardware resources, so some performance loss is to be expected (if the
network wasn't the bottleneck for separate machines). I'm not sure if
that can explain that degree of difference, though. I believe Rx in the
past has had some odd behavior that really low RTT, but any known fixes
there should have been in 1.6 for awhile.

I expect a similar thing happens on 1.4? Though of course the baseline
performance is probably different, so it's not really comparing the same
thing.


I think there were differences with 1.4, but it's been a while since that 
particular machine has run 1.4 that I don't remember exactly.


One would think that a modern 8-core box with 32GB of memory would provide 
for enough 'isolation' between server and client. Maybe I just don't know 
enough about what resources are involved in that case.


I'm still curious what's actually causing that much of a performance loss.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Expected performance

2011-05-20 Thread Andy Cobaugh

On 2011-05-19 at 14:19, Andrew Deason ( adea...@sinenomine.net ) said:

On Thu, 19 May 2011 14:57:16 -0400 (EDT)
Andy Cobaugh  wrote:


You can certainly get close if your disk for the disk cache is fast
enough. I've seen close to 80MB/s with 15K SAS under ideal conditions.

Re: client and server on the same machine - I've seen that actually
result in lower performance. When you take the physical network out of
the mix, Rx starts limiting you as a function of CPU usage it seems.


...or it's that you're writing to the same disk twice as much. If the
cache and /vicepX are on the same disk, it seems pretty intuitive that
it's going to be slower.


It was with memcache.

Just ran some quick tests yesterday to confirm what I saw before.

Here I have two different clients, 'mal' and 'badger'. badger has a 
fileserver. There is another fileserver, fs8, which serves the purpose of 
showing maximum client performance (fs8 is our biggest and fastest 
fileserver currently).


Clients on both mal and badger have essentially the same config, using a 
655360 byte memcache.


iozone was used in all tests.

client -> server

mal -> fs8:
http://www.bx.psu.edu/~phalenor/afs_performance_results/mal.bx.psu.edu-201105121302/

mal -> badger:
http://www.bx.psu.edu/~phalenor/afs_performance_results/mal.bx.psu.edu-201105191536/

badger -> fs8:
http://www.bx.psu.edu/~phalenor/afs_performance_results/badger.bx.psu.edu-201105191708/

badger -> badger:
http://www.bx.psu.edu/~phalenor/afs_performance_results/badger.bx.psu.edu-201105191618/

mal and badger are slightly different hardware, but the tests above show 
that we get very similar performance between all server and client 
combinations except the case where client and server are on the same 
machine.


Maybe I'm missing something here?

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Expected performance

2011-05-19 Thread Andy Cobaugh

On 2011-05-19 at 13:25, Andrew Deason ( adea...@sinenomine.net ) said:

On Tue, 17 May 2011 23:14:03 +0100
Hugo Monteiro  wrote:


- Low performance and high discrepancy between test results
Transfer rates (only a few) hardly touched 30MB/s between the server and
a client sitting on the same network, connected via GB ethernet. Most of
the times that transfer rate is around 20MB/s, falling down to 13 or
14MB/s in some cases.


The client and server configs would help. I'm not used to looking at
single-client performance, but... assuming you're using a disk cache,
keep in mind the data is written twice: once to the cache and once on
the server. So, especially when you're running the client and server on
the same machine, there's no way you're going to reach the theoretical
110M/s of the disk.


You can certainly get close if your disk for the disk cache is fast 
enough. I've seen close to 80MB/s with 15K SAS under ideal conditions.


Re: client and server on the same machine - I've seen that actually result 
in lower performance. When you take the physical network out of the mix, 
Rx starts limiting you as a function of CPU usage it seems.



You may want to see what you get with memcache (or if you want to try a
1.6 client, cache bypass) and a higher chunksize. Just running dd on a
box I have, running a 1.4 afsd with -memcache -chunksize 24 made it jump
from the low 20s to high 40s/low 50s (M/s), after starting with the
defaults for a 100M disk cache.


Just to add some more data points...

I recently saw peaks of 90M/s for memcache for single client writes. Reads 
from memcache can be as fast as your memory is, so upwards of a couple 
GB/s.


In general, 1.6 memcache > 1.4 memcache > 1.6 diskcache > 1.4 diskcache. 
1.6 disk cache uses a LOT less CPU than 1.4 disk cache, however. Nice for 
processes that need IO and CPU at the same time on a machine that might 
already be lacking CPU.


Options I used to get those numbers with 1.6.0pre5:

Client:
-dynroot -fakestat -afsdb -nosettime -stat 48000 -daemons 12 -volumes 512 
-memcache -blocks 655360 -chunksize 19

Server:
-p 128 -busyat 600 -rxpck 4096 -s 1 -l 1200 -cb 100 -b 240 -vc 1200 
-abortthreshold 0 -udpsize 1048576

Server in this case is a very new 16-core Opteron box with 32GB of RAM (it 
runs multiple fileserver instances under Solaris zones). Client is a 
relatively new 8-core Opteron box with 64GB of memory.


Also in general, client performance seems to get worse the more CPUs you 
have. Our 48-core boxes tend to get lower numbers than our smaller 16 and 
8 core boxes. I haven't done too many comparison tests to really quantify 
how much of a difference that makes, though.


Cache bypass definitely makes things faster for things that aren't cached, 
though I will withold performance numbers for that as I was testing bypass 
inside a ESX vm (one of our webservers), but within the same machine, it 
got similar numbers to disk cache after the files had been cached (where 
disk cache is a raw FC LUN)


Under normal conditions with fairly modern hardware, you 
should expect 50M/s with some simple tuning (-chunksize mostly, and 
-memcache if your machine has the memory to spare).


I haven't done any testing for the multi-client case, as that's slightly 
more difficult to properly test while holding everything else constant. By 
multi-client, I mean multiple actual cache managers involved as well as 
multiple users behind the same cache manager.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] When to publish security advisories?

2011-04-15 Thread Andy Cobaugh

On 2011-04-15 at 16:46, Russ Allbery ( r...@stanford.edu ) said:

Patricia O'Reilly  writes:


Is there any problem connecting 1.6 clients with 1.4.14 servers?


Nope.  Works fine.  Overall, 1.6 clients seem to be working as well or
better than 1.4 clients, although someone has reported reproducible hangs
and crashes to me with 1.6 (and I've been trying to get him to file a bug
report).  But I don't know of anyone else having that trouble.


Any day now we'll be pushing 1.6.0pre4 out to all of our Linux clients. 
Some of them have been running various versions of 1.5 for many many 
months now (web servers and the like). My testing shows 1.6 is noticeably 
faster than 1.4, disk cache and memcache. Most of our servers are still 
1.4.x.


As far as my site is concerned, we're considering the 1.6.0pre4 client 
stable on Linux.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Reporting on some recent benchmark results

2011-04-06 Thread Andy Cobaugh

On 2011-04-06 at 14:41, Andrew Deason ( adea...@sinenomine.net ) said:

On Wed, 6 Apr 2011 15:29:58 -0400 (EDT)
Andy Cobaugh  wrote:


No; I didn't even think that was ours to handle. So, if you stop the
client, the AFS 'mount' entry stays there? I assume the multiple AFS
lines are identical? What kernel?


Lines are identical, as such:

AFS on /afs/ type afs (rw)


Somewhere you are specifying /afs/ as the AFS mountpoint, instead of
/afs (such as in /usr/vice/etc/cacheinfo, or afsd args). If you change
it to /afs, this appears to go away.


Yep, in cacheinfo.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: Reporting on some recent benchmark results

2011-04-06 Thread Andy Cobaugh

On 2011-04-06 at 14:25, Andrew Deason ( adea...@sinenomine.net ) said:

On Wed, 6 Apr 2011 11:44:17 -0400 (EDT)
Andy Cobaugh  wrote:


One observation regarding 1.6.0pre4: In stop'ing and start'ing the
client via the init script, AFS shows up as being mounted several
times in the output of 'mount' - is that to be expected?


No; I didn't even think that was ours to handle. So, if you stop the
client, the AFS 'mount' entry stays there? I assume the multiple AFS
lines are identical? What kernel?


Lines are identical, as such:

AFS on /afs/ type afs (rw)

Running 2.6.18-194.32.1.el5

Interestingly, if I umount /afs/ after I shutdown the client, I get 
"umount: /afs/: not mounted", but the mount entries go away one at a time 
with each umount invocation.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Reporting on some recent benchmark results

2011-04-06 Thread Andy Cobaugh

On 2011-04-06 at 16:06, Simon Wilkinson ( s...@inf.ed.ac.uk ) said:


On 4 Apr 2011, at 22:18, Garrett Wollman wrote:


Over the past few days I have performed several benchmarks comparing
the performance of various OpenAFS server and client configurations.


Thanks for this - it makes for really interesting reading.

The statistic I'm really interested in at present, unfortunately, isn't 
one that you cover. With the imminent release of 1.6.0, what would be 
really interesting to know is a direct comparison between 1.4.14 and 
1.6.0 on the same hardware, for the same workload. I know of workloads 
in which I can clearly show that 1.6.0 is faster, what would be really 
useful is to see, and to understand, is workloads for which it is 
slower.


All of my iozone tests are here:

http://www.bx.psu.edu/~phalenor/afs_performance_results/

For each run, you get the raw iozone output, and an 'info' file that 
collects information about the client: version, memory, afsd options, and 
location of the test volume.


I'm only interested in single client, single thread performance - when 
your users are dealing with files 10's and 100's of GB in size, that's all 
you really care about.


The recent tests on the 'c2' machine are my attempt to decide whether to 
deploy 1.6.0pre4 on all of our clients in place of 1.4.14.


1.6.0pre4 with memcache is looking very promising so far, easily capable 
of saturating a gigabit connection under the right conditions..


Of course, none of our tests are with encryption turned on. In our 
experience, it's far too easy for just a few clients to bring down even 
some of our fastest fileservers when they're all on gigabit.


One observation regarding 1.6.0pre4: In stop'ing and start'ing the client 
via the init script, AFS shows up as being mounted several times in the 
output of 'mount' - is that to be expected?


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] openafs 1.6.0pre4 and OSX 10.6.7 and 64bit kernel (NOT really) FIXED ;(

2011-03-29 Thread Andy Cobaugh

On 2011-03-29 at 20:10, Chris Jones ( christopher.rob.jo...@cern.ch ) said:

Hi,


Chris-Jones-Macbook-Pro /Library/OpenAFS/Tools/bin > ./cmdebug localhost
Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:1133))
** Cache entry @ 0xd35161a0 for 0.1.16777996.1 [dynroot]
   locks: (none_waiting, write_locked(pid:1133 at:599))
 18 bytes  DV1  refcnt 0
   callback expires 0
   0 opens  0 writers
   mount point
   states (0x5), stat'd, read-only


and a slightly different one later on (whilst waiting to just cd into a 
directory under /afs/cern.ch)

Chris-Jones-Macbook-Pro /Library/OpenAFS/Tools/bin > ./cmdebug localhost
Lock afs_discon_lock status: (none_waiting, 1 read_locks(pid:1156))
** Cache entry @ 0xd35184b0 for 382.537112396.26.32 [cern.ch]
   locks: (none_waiting, write_locked(pid:1156 at:66))
  7 bytes  DV1  refcnt 0
   callback 263a6708expires 1301440202
   0 opens  0 writers
   normal file
   states (0x1), stat'd


fwiw, we started seeing this on Leopard as early as 1.5.77. I just now saw 
this on 1.6.0pre4 on Snow Leopard with the 32-bit kernel. It's also 
happened with 1.6.0pre2 on Leopard.


Sometimes it hangs for only a few minutes. Other times, it will hang for 
hours until someone reboots.


cmdebug always reports 1 or more read locks on afs_discon_lock, with a 
random pid.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-07 Thread Andy Cobaugh

On 2011-03-07 at 11:27, Andrew Deason ( adea...@sinenomine.net ) said:

On Fri, 4 Mar 2011 16:23:34 -0500 (EST)
Andy Cobaugh  wrote:


Volume name in question is pub.m.rpmforge. The .backup volume in
particular. This volume was backup'd this morning at approx. 0005, with
this output from vos backup:

Failed to end the transaction on the rw volume 536873153
: server not responding promptly
Error in vos backup command.
: server not responding promptly


Does this happen often enough that you could tell me if a patch makes it
go away? I'd like to know if this fixes it (it'll apply to 1.6.0pre2
with a little fuzz):
<http://git.openafs.org/?p=openafs.git;a=commitdiff_plain;h=69077559a7fc5784445ed56a2bfd613a5bb4174b>


I'd like to wait for it to happen one more time before calling it a 
problem. I'll try that if it happens again. Given the frequency that this 
has happened, I wouldn't be surprised if it happens again before 
wednesday.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-07 Thread Andy Cobaugh

On 2011-03-07 at 11:03, Andrew Deason ( adea...@sinenomine.net ) said:

On Fri, 4 Mar 2011 19:42:04 -0500 (EST)
Andy Cobaugh  wrote:


Tue Mar  1 00:02:12 2011 VReadVolumeDiskHeader: Couldn't open header for volume 
536871061 (errno 2)

means the volume doesn't exist. It's not that it's corrupt or
anything; the volume was completely deleted. (or something just
deleted the .vol header, but the other messages suggest it was
deleted normally)


What does 'deleted normally' mean in this context? Nothing touched the
volume since the previous night, where it created the .backup volume
just fine. Unfortunately, those logs have since rolled over, so I
don't have anything older than from when I restarted the fileserver at
16:12 on Mar 1.


Deleted normally as in, a 'vos remove' or 'vos zap'. The volume header
didn't exist, and we didn't encounter any extant files when recreating
the clone, suggesting that the backup clone was cleanly deleted before
we tried making a new one.


Nope, nothing like that, so it must have been deleted abnormally somehow. 
I'll keep an eye out for this next time.



Yes, when the volume got caught in that state, any access could have
triggered a salvage (since it was in a half-created state). So, an
examine or someone just trying to access 'yesterday' (or whatever you
call it) could have caused that.


daily_backup_snapshot

That was probably it.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-04 Thread Andy Cobaugh

On 2011-03-04 at 16:30, Andrew Deason ( adea...@sinenomine.net ) said:

On Fri, 4 Mar 2011 17:20:34 -0500 (EST)
Andy Cobaugh  wrote:


The first issue you reported had problems much earlier before the
log messages you gave. Did anything happen to the backup volume
before that?  No messages referencing that volume id? Did you or
someone/thing else remove the backup clone or anything?


Nope. We don't even access the backup volume when doing the file-level
backups anymore.


Well, _something_ deleted it, unless it didn't exist before 1 mar 2011.
This message


It certainly did exist before that, and nothing I did and no part of our 
backup system would have delete it.



Tue Mar  1 00:02:12 2011 VReadVolumeDiskHeader: Couldn't open header for volume 
536871061 (errno 2)

means the volume doesn't exist. It's not that it's corrupt or anything;
the volume was completely deleted. (or something just deleted the .vol
header, but the other messages suggest it was deleted normally)


What does 'deleted normally' mean in this context? Nothing touched the 
volume since the previous night, where it created the .backup volume just 
fine. Unfortunately, those logs have since rolled over, so I don't have 
anything older than from when I restarted the fileserver at 16:12 on Mar 
1.



Yes, the zaps were me trying to get the .backup into a usable state.
Though, the first string of salvages started in the middle of the
afternoon without any intervention - I think the event that caused
them is what's missing from the picture.


Well, do you have the messages from around then?


Ugh, no. Hopefully I will if it happens again.


I'm still a little hesitant to bos salvage that server - whole reason
we're trying to switch to DAFS is to avoid the multi-hour fileserver
outages.


Salvaging a single volume is the same as a demand-salvage; it is no
slower and no more impactful than an automatically-triggered one. But
you can manually trigger the salvage of a single volume group in cases
like this (e.g. when the fileserver refuses to because it's been
salvaged too many times).


Ok, I had to bos salvage the .backup volume directly with -forceDAFS. When 
I did this when this happened on my machine at home, it wasn't so easy. In 
that case, it was with an RO clone. I think I had to remsite, then remove 
or zap or some combination, along with manually deleting the .vol. I wish 
I had payed closer attention then.


I still have no idea what caused the volume to spontaneously need 
salvaging Tuesday afternoon. I did notice that until I fixed the BK 
volume, if I did a 'vos exam home.gsong.backup', that triggered a salvage.


Wish I had more to go on. I'll be working on standardizing our logging 
configuration across servers next week, logging via syslog, etc, so we 
don't lose valuable logs like this.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-04 Thread Andy Cobaugh

On 2011-03-04 at 15:59, Andrew Deason ( adea...@sinenomine.net ) said:


What about the command immediately preceding this? Anything odd about
it; time it took to execute, or any warnings/errors/etc?


The commands before that all completed in 30 seconds or less. No messages 
other than that.



I'm not sure how related this is to the other issue I saw, where the
backup clone was left in a much worse state.


I don't think it is; that error above isn't even really much of a
problem; we just failed to end the transaction, but the the transaction
is idle by that point and will be ended automatically after 5 minutes
(as you see in the VolserLog).

The first issue you reported had problems much earlier before the log
messages you gave. Did anything happen to the backup volume before that?
No messages referencing that volume id? Did you or someone/thing else
remove the backup clone or anything?


Nope. We don't even access the backup volume when doing the file-level 
backups anymore.



The first messages around Tue Mar  1 00:02:12 2011 look like what would
happen if you tried to recreate the BK after it was deleted with that
code (fixed in the patches I mentioned before). The subsequent salvages
are from an error to read some header data, which could be explained by
the attempted 'zap's and such, assuming those messages were during/after
you noticed the volume being inaccessible and tried forcefully deleting
it.


Yes, the zaps were me trying to get the .backup into a usable state. 
Though, the first string of salvages started in the middle of the 
afternoon without any intervention - I think the event that caused them 
is what's missing from the picture.


I'm still a little hesitant to bos salvage that server - whole reason 
we're trying to switch to DAFS is to avoid the multi-hour fileserver 
outages.


I'm going to take some time either later tonight, or early next week to go 
back through the logs and try to make more sense of them from a 
chronological standpoint, and see if there's anything I missed.


There's still a bug somewhere that causes a .backup volume to go off-line 
after being created. I have a test volume on one of the problem 
fileservers right now, that's been vos backup'd once a minute since 
yesterday without a problem. So, something else must have to happen to 
cause this, just not sure what.



--
Andrew Deason
adea...@sinenomine.net

___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info



--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-04 Thread Andy Cobaugh


Ok, an update to the problem I alluded to this morning.

Volume name in question is pub.m.rpmforge. The .backup volume in 
particular. This volume was backup'd this morning at approx. 0005, with 
this output from vos backup:


Failed to end the transaction on the rw volume 536873153 
: server not responding promptly
Error in vos backup command. 
: server not responding promptly


That command returned in <5s. I then see this in VolserLog:

Fri Mar  4 00:05:15 2011 1 Volser: Clone: Recloning volume 536873153 to volume 
536873155
Fri Mar  4 00:10:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 300 seconds
Fri Mar  4 00:10:49 2011 trans 13950 on volume 536873153 has been idle for more 
than 330 seconds
Fri Mar  4 00:11:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 360 seconds
Fri Mar  4 00:11:49 2011 trans 13950 on volume 536873153 has been idle for more 
than 390 seconds
Fri Mar  4 00:12:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 420 seconds
Fri Mar  4 00:12:49 2011 trans 13950 on volume 536873153 has been idle for more 
than 450 seconds
Fri Mar  4 00:13:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 480 seconds
Fri Mar  4 00:13:49 2011 trans 13950 on volume 536873153 has been idle for more 
than 510 seconds
Fri Mar  4 00:14:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 540 seconds
Fri Mar  4 00:14:49 2011 trans 13950 on volume 536873153 has been idle for more 
than 570 seconds
Fri Mar  4 00:15:19 2011 trans 13950 on volume 536873153 has been idle for more 
than 600 seconds
Fri Mar  4 00:15:19 2011 trans 13950 on volume 536873153 has timed out

Nothing in any of the other log files, and nothing interesting in FileLog 
other than:


Mar  4 00:05:15 horvitz fileserver[2236]: VOffline: Volume 536873153 
(pub.m.rpmforge) is now offline (A volume utility is running.)
Mar  4 00:05:15 horvitz fileserver[2236]: fssync: breaking all call backs for 
volume 536873155

(and then tsm goes to access the RW volume, at which point I guess it's 
brought back online)


Mar  4 01:00:31 horvitz fileserver[2236]: SAFS_FetchStatus,  Fid = 
536873153.1.1, Host 128.118.200.6:7001, Id 117
Mar  4 01:00:31 horvitz fileserver[2236]: VOnline:  volume 536873153 
(pub.m.rpmforge) attached and online

I noticed this when nagios reported that one of the volumes on this server 
was marked off-line.


Now, interestingly, I just ran another vos backup against the same volume:

$ vos backup pub.m.rpmforge
Created backup volume for pub.m.rpmforge

Fri Mar  4 16:05:14 2011 1 Volser: Clone: Recloning volume 536873153 to volume 
536873155

The pub.m.rpmforge.backup is now on-line.

Subsequent backups seem to be fine.

I'm not sure how related this is to the other issue I saw, where the 
backup clone was left in a much worse state. The immediate effects of the 
vos backup are the same, but I'm still not sure what caused the demand 
salvage of the volume later during the day in that case. In that case, 
that was a home directory that was very much in use, and something 
triggered the salvage, there's just nothing in the logs to indicate why.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-04 Thread Andy Cobaugh


Fyi, I have another .backup volume that started having the same issues 
this morning on a different machine under 32-bit linux with 1.6.0pre2. 
I'll gather some more details later today.


This same fileserver had no issues running 1.5.77, 1.5.78, or 1.6.0pre1 
(well, other than vos backup not working at all with certain versions).


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-03 Thread Andy Cobaugh

On 2011-03-03 at 11:05, Andrew Deason ( adea...@sinenomine.net ) said:

On Tue, 1 Mar 2011 22:23:34 -0600
Andrew Deason  wrote:


The problem with the recovery is (probably) that the salvager doesn't
properly inform the fileserver when it destroys a volume, so the
erroneous volume state prevents you from doing anything with the
volume after it's destroyed. I need to test that behavior out tomorrow
and see what happens.


This is what happens, and can be easily seen if you corrupt the header
for a clone, try to access it, and try to recreate it after the salvager
deletes it. Gerrit 4117-4120 have been submitted to fix this.


Excellent. So I guess the remaining question is: how did the header get 
corrupted in the first place.


I'll be sure to keep a closer eye on things next time I see this. I've 
seen this twice on two completely different systems (my home machine, and 
a production fileserver at work, both after upgrading to 1.6 [and I think 
they were both pre2]), so I'm sure I'll see it again, just a matter of 
time.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-01 Thread Andy Cobaugh

On 2011-03-01 at 22:23, Andrew Deason ( adea...@sinenomine.net ) said:

On Tue, 1 Mar 2011 22:38:07 -0500 (EST)
Andy Cobaugh  wrote:


(and I think you meant dafssync-debug. I may not have mentioned that.)


fssync-debug should detect a DAFS fileserver and execute dafssync-debug
for you.


If I just do fssync-debug, it tells me this:

*** server asserted demand attach extensions. fssync-debug not built to
*** recognize those extensions. please recompile fssync-debug if you need
*** to dump dafs extended state


Have you done successful 'vos backup's of that volume after the
1.6.0pre2 upgrade? Or did you upgrade and it broke?


Oh yes, definitely. It was upgraded on Feb 19.


Hmm, well, I interpreted "turned debugging up" to mean "up all the way",
which actually probably isn't true. The messages I'm looking for are at
level 125, and there's a lot of them (they log every FSSYNC request and
response).


Yeah, only running at 5 right now.


If I look in FileLog.old (I restarted at some point to up the debug
level), I see these lines:


You can change that with SIGHUP/SIGTSTP (unless you're doing that for a
permanent change).


Is that to increase/decrease logging level, respectively?


Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection 
(cnt=94804)
Tue Mar  1 16:11:34 2011 FSYNC_com:  read failed; dropping connection 
(cnt=94805)


There should be a SYNC_getCom right before these (though it probably
just says "error receiving command"). Just to be sure, there aren't any
processes dying/respawning in BosLog{,.old}, are there?


No processing dying, fortunately.


Failed to end the transaction on the rw volume 536871059
: server not responding promptly
Error in vos backup command.
: server not responding promptly


That's RX_CALL_TIMEOUT, which I'm not used to seeing on volserver
RPCs... Do you know how long it took to error out with that? If it takes
a while, a core of the volserver/fileserver while it's hanging would be
ideal. It might just be the fileserver trying to salvage the volume a
bunch of times or something, though, and that takes too long.


From the start of the vos backup command until it returned was 16s 

according to our logs.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Re: 1.6.0pre2 - more vos issues, possible bug

2011-03-01 Thread Andy Cobaugh

On 2011-03-01 at 20:27, Andrew Deason ( adea...@sinenomine.net ) said:

On Tue, 1 Mar 2011 18:57:50 -0500 (EST)
Andy Cobaugh  wrote:


This has happened at least once at work on Solaris 10 x86 with a
.backup volume, as seen above, and at least once on one of my home
machines on 64bit linux with an RO clone.


The volume was probably deleted during the salvage (it was already gone
by the time of the 'zap -force'), but the fileserver still has the
volume in an 'error' state.

Could you

volinfo /vicepa 536871061
fssync-debug query 536871061


# volinfo /vicepcb 536871061
Inode 2305843649298038783: Good magic 78a1b2c5 and version 1
Inode 2305843649365147647: Good magic 99776655 and version 1
Inode 2305843649432256511: Good magic 88664433 and version 1
Inode 2305843641043648511: Good magic 99877712 and version 1
Volume header for volume 536871061 (home.gsong.backup)
stamp.magic = 78a1b2c5, stamp.version = 1
inUse = 0, inService = 0, blessed = 1, needsSalvaged = 0, dontSalvage = 0
type = 2 (backup), uniquifier = 6743359, needsCallback = 0, destroyMe = d3
id = 536871061, parentId = 536871059, cloneId = 0, backupId = 536871061, 
restoredFromId = 0
maxquota = 134217728, minquota = 0, maxfiles = 0, filecount = 221896, diskused 
= 56684296
creationDate = 1299021740 (2011/03/01.18:22:20), copyDate = 1299021740 
(2011/03/01.18:22:20)
backupDate = 1299021740 (2011/03/01.18:22:20), expirationDate = 0 
(1969/12/31.19:00:00)
accessDate = 1299021734 (2011/03/01.18:22:14), updateDate = 1299021636 
(2011/03/01.18:20:36)
owner = 1045, accountNumber = 0 
dayUse = 0; week = (0, 0, 0, 0, 0, 0, 0), dayUseDate = 0 (1969/12/31.19:00:00)

volUpdateCounter = 135816


(and I think you meant dafssync-debug. I may not have mentioned that.)

# dafssync-debug query 536871061
calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY)
FSSYNC service returned 0 (SYNC_OK)
protocol header response code was 0 (SYNC_OK)
protocol reason code was 0 (SYNC_REASON_NONE)
volume = {
hashid  = 536871061
header  = 0
device  = 79
partition   = 102a75a8
linkHandle  = 0
nextVnodeUnique = 0
diskDataHandle  = 0
vnodeHashOffset = 79
shuttingDown= 0
goingOffline= 0
cacheCheck  = 0
nUsers  = 0
needsPutBack= 0
specialStatus   = 0
updateTime  = 0
vnodeIndex[vSmall] = {
handle   = 0
bitmap   = 0
bitmapSize   = 0
bitmapOffset = 0
}
vnodeIndex[vLarge] = {
handle   = 0
bitmap   = 0
bitmapSize   = 0
bitmapOffset = 0
}
updateTime  = 0
attach_state= VOL_STATE_ERROR
attach_flags= VOL_IN_HASH | VOL_ON_VBYP_LIST
nWaiters= 0
chainCacheCheck = 3
salvage = {
prio  = 0
reason= 0
requested = 0
scheduled = 0
}
stats = {
hash_lookups = {
hi = 0
lo = 155
}
hash_short_circuits = {
hi = 0
lo = 0
}
hdr_loads = {
hi = 0
lo = 0
}
hdr_gets = {
hi = 0
lo = 0
}
attaches = 0
soft_detaches= 0
salvages = 16
vol_ops  = 1
last_attach  = 0
last_get = 0
last_promote = 0
last_hdr_get = 0
last_hdr_load= 0
last_salvage = 1299019004
last_salvage_req = 1299018855
last_vol_op  = 1299018890
}
vlru = {
idx = 5 (VLRU_QUEUE_INVALID)
}
pending_vol_op  = 0
}


Do you want the .vol file for this volume?


on the fileserver?

I have an idea on why you can't get the volume usable again, but I have
no clue as to what the original inconsistency was that caused the first
salvage.


My suspicion is that a previous 'vos backup' left it in this state. The 
volume group hasn't been touched other than for backups in many months. 
I've never had a problem like this with that fileserver or volume until I 
upgraded from 1.4.11 to 1.6.0pre2.



I would have included more snippets from FileLog as well, but 

I have >> the debug level turned up to try to track down a possible


Then you should have some logs mentioning 'FSYNC_com' around 'Tue Mar  1
00:02:25 2011' explaining why we refused to give out the volume.  (You
don't perhaps 

[OpenAFS] 1.6.0pre2 - more vos issues, possible bug

2011-03-01 Thread Andy Cobaugh


I have comments interspersed with log file snippets in a plain text file 
here:


http://users.bx.psu.edu/~phalenor/problem

I'm not sure what led to the initial problems with the .backup volume. We 
vos backup every volume every night.


This has happened at least once at work on Solaris 10 x86 with a .backup 
volume, as seen above, and at least once on one of my home machines on 
64bit linux with an RO clone.


I would have included more snippets from FileLog as well, but I have the 
debug level turned up to try to track down a possible authentication bug 
(where tokens no longer work against a 1.6.0pre2 fileserver, but are fine 
against other 1.4 fileservers - more on that after I've gathered more 
evidence).


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Revival: Recommended way to start up OpenAFS on Solaris 10?

2011-02-21 Thread Andy Cobaugh

On 2011-02-21 at 16:36, Jeff Blaine ( jbla...@kickflop.net ) said:

Best I can tell, the thread ended with this message from
David Boyes @ SNA:

http://www.openafs.org/pipermail/openafs-info/2010-January/032816.html

Anything?  Anyone?  Did we get anywhere?  Just looking to
snarf someone's SMF stuff that works.


https://github.com/phalenor/openafs-smf

I have a feeling those are less than correct, but it's a start. I've had 
issues with the server manifest a few times, when it comes to shutting 
down or restarting. Feel free to push any changes you end up making.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] pam_afs_session in Fedora?

2011-02-18 Thread Andy Cobaugh

On 2011-02-18 at 12:33, Ken Dreyer ( ktdre...@ktdreyer.com ) said:

On Fri, Feb 18, 2011 at 12:19 PM, Brandon S Allbery KF8NH
 wrote:

On 2/18/11 14:14 , Andy Cobaugh wrote:

Just curious why you're not just using the stock pam_krb5? At least in a
plain jane krb5 environment, pam_krb5 has worked fine for us (though I
haven't tried very recent Fedora).


There are programs which don't do PAM right; in particular, they run
pam_krb5 in root's context instead of the user's context, which worst-case
results in a UID-based (no PAG) root token and no user token.  This works
fine with krb5 if they do it right, but the token is a side effect that
can't be corrected in the session module.


Right, I want PAG support and the other benefits of pam_afs_session.

RedHat's pam_krb5's AFS support is not very good. In addition to not
granting PAGs, I've seen situations where it will check if AFS is
running, and if so, it attempts to convert the user's Kerberos 5
credential to a Kerberos 4 credential. This will time out because it
cannot find the Kerberos 4 KDCs (none exist). Logins were taking a
minute or more in these cases. Setting "ignore_afs" solved the
problem.


I can log in with pam_krb5, and I get put in a keyring-based PAG. I do see 
that the krb4_* options are no longer available in f14.


In any event, would definitely welcome pam_afs_session in EPEL, at least 
our PAM configurations would be somewhat similar across platforms.


--andy

Re: [OpenAFS] pam_afs_session in Fedora?

2011-02-18 Thread Andy Cobaugh

On 2011-02-18 at 11:16, Ken Dreyer ( ktdre...@ktdreyer.com ) said:

I would like to try to get Russ's pam_afs_session into Fedora/EPEL.
Since OpenAFS itself is not permitted for inclusion (I think it's
because "no kernel modules"?), I'm hoping that there will still be
utility to at least having pam_afs_session available. It won't be
built with openafs-devel, but I don't think that's a problem, right?
I've tested building in mock without depending on AFS at all, and it
seems to work.


Just curious why you're not just using the stock pam_krb5? At least in a 
plain jane krb5 environment, pam_krb5 has worked fine for us (though I 
haven't tried very recent Fedora).


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: [OpenAFS-announce] OpenAFS 1.6.0 release candidate 2 available

2011-02-15 Thread Andy Cobaugh


So far so good. Deployed 1.6.0pre2 on 64 and 32 bit CentOS. Clients with 
disk cache and memcache, as well as two DAFS fileservers.


We've been running a mixture of 1.5.77, 1.5.78, and 1.6.0pre1 for some 
time now, on OSX, Solaris SPARC and x86, and 32/64-bit Linux, as both DAFS 
fileservers and clients with only a few bugs here and there, all of which 
seem to have been fixed with 1.6.0pre2 (mostly around vos).


Ran some quick iozone tests on 1.6.0pre2 with client and server on the 
same machine over loopback on 64-bit CentOS. There don't appear to be any 
gross single-client performance regressions in that case.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] calculating memory

2011-01-28 Thread Andy Cobaugh

On 2011-01-28 at 22:38, Gary Gatling ( gsgat...@eos.ncsu.edu ) said:


I am going to use RHEL 6 for the fileserver. I have a test VM up and working 
with openafs 1.4.14 to start with. Seems to work ok with ext4. The version of 
VMware we are using is VMware ESX. We pay full price for that. I think we are 
slowly moving to version 4 but now I think now its mostly 3. (We can use 
vmxnet 2 NIC but not 3 on most boxes so far)


Sounds similar to what we do, except switch rhel for Solaris 10. You 
definitely want to use whatever the latest vmxnet drivers you have. This 
speeds the network up tremendously, or at the very least reduces CPU 
overhead. I think in Centos 5.5 64bit I couldn't get much more than 
~300Mbps with the virtualized nic, but can get at least 800Mbps with 
vmxnet3. Similar results under Solaris.


Only reason we use Solaris is for compression. With LZJB, we see almost 
2:1 compression on our home directories, which are currently using about 
4TB+ of our SAN storage, which really means we have closer to 6TB of 
actual home directories. LZJB uses hardly any CPU, and I'm sure in some 
cases it's faster to compress than to write to disk. Oh, and end-to-end 
checksums is a nice bonus too if you don't trust your underlying storage, 
even if it is fancy uber-expensive SAN storage (we don't do ZFS RAID, just 
zpools with a single vdev -> RAID5 LUN).


We currently run 3 such fileserver VMs on VMware ESXi 4.x on the same box, 
2 vCPUs each (fileserver will barely use 2 CPUs, so factor in that plus a 
CPU for volserver when doing vos moves). Each of those VMs has 2GB of 
memory assigned to it right now, and that seems to be enough even with ZFS 
in play. If I'm reading the output from ps correctly, one of our larger 
DAFS fileservers running on Centos 5.5 64bit is using 1.8GB, davolserver 
1.5GB. (That's with -p 128 to both commands, so actual memory usage is 
probably much smaller than that).


It seems like on Solaris 10 with openafs 1.4.11 the server seems to use about 
1 GB when its not backing up. I am not sure how much it uses at "peak times" 
or when doing full backups. And I don't have the new backup software (yet). 
Teradactyl is the backup software we are switching to to ditch Solaris for 
Linux.


Just to add another datapoint to the mix, we use TSM (provided by our 
university's central IT), and just do file-level backups. At least that 
way we're server agnostic (though it's not the fastest solution by a 
longshot - the TSM server is the bottleneck in our case, so there wasn't 
any point in choosing a faster backup strategy).


I'm curious - how are you backing up AFS now?

I gather real servers aren't an option `cause management really likes moving 
most everything into VMware. We already moved all our license and web servers 
into VMware and we have some other weird servers working in it also. Even 
Windows infrastucture like domain controllers and stuff. If everyone says its 
a bad idea I can make an argument though. :)


Eh, if you push your data onto these virtualized servers and performance 
takes a hit (we'll sometimes see sporadic slowdowns when vos moves 
are happening on the same ESX host), then obviously you can try to take 
the "I told you so" approach and get some bare metal hardware to compare 
things to.


Oh, and we also do raw device maps in ESX. I haven't quantified how much 
faster raw device maps are over  -> storage> -> VMXFS -> SAN, but being able to access that LUN from a non-ESX 
box and see ext4 instead of VMXFS sounds like the makings of a good DR 
strategy.


One more thing: SAN raw device maps in ESX 4 are limited to 2TB. I guess 
the hypervisor is still using Linux 2.4, and there are some limitations 
from Linux itself in play there. You can create a VMXFS bigger than 2TB by 
using multiple extents (I think). iSCSI doesn't have this limitation. Just 
something to be aware of.


I would be very curious to see any benchmarks you come up with. Things 
like iozone on the vicep itself, iperf between VMs on the same vSwitch, 
between VMs on different hosts, etc.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] PTS membership (or existence) based on external data?

2011-01-21 Thread Andy Cobaugh

On 2011-01-21 at 11:36, Stephen Joyce ( step...@physics.unc.edu ) said:

Hello,

Has anyone written a script or utility to add/remove PTS entries (either 
membership in PTS groups or actual existence of the PTS user account would be 
acceptable) from an external database, based on date?


My AFS cell is in the middle of transitioning from authenticating against a 
departmental KRB5 realm to authenticating against a central University-wide 
KRB5 realm. I'd like to be able to continue to have the ability to expire 
students' access to resources automatically--when their affiliation with the 
Department expires: at the end of a semester, research project, etc.


So I thought I'd ask if anyone has an in-house tool, querying expiration 
dates from an external source such as a non-authoritative KDC, SQL, etc) and 
is willing to share, before I possibly reinvent the wheel.


This is what we use:

https://github.com/phalenor/ldap2pts

It's not perfect, is very specific to our site, has at least one bug that 
needs to be fixed (owner of user:group groups needs to match the 
username), screen scrapes all of the pts commands, is an example of some 
non-ideal Perl programming, and won't scale too well. We run it once every 
10 minutes, but we only have 259 accounts and 92 groups, so it may only 
take on the order of 30 seconds to run (on a SunFire V100). I wanted to 
add support for parsing the output of an openldap accesslog so it syncs in 
almost real-time and doesn't have to compare all of ldap against all of 
pts.


Anyway, might give you some different ideas.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] k5start, AFS and long-running daemon

2011-01-17 Thread Andy Cobaugh

On 2011-01-17 at 16:17, Stephen Quinney ( step...@jadevine.org.uk ) said:

I am having some problems with trying to use k5start to maintain a
kerberos credential cache for a long-running daemon. In particular,
it's maintaining the AFS tokens which is problematic.

I noticed on http://www.eyrie.org/~eagle/software/kstart/todo.html,
the following comment on the k5start todo list:

"Add a flag saying to start a command in a PAG and with tokens and
then keep running even if the command exits. This would be useful to
spawn a long-running daemon inside a PAG and then maintain its tokens,
even if k5start and the daemon then become detached and have to be
stopped separately."

I have a daemon which detaches but which needs to access AFS
directories. Running k5start in the background works great for
maintaining the kerberos cache (which is also needed for DB access)
it's just AFS which is causing problems. So this sounds like exactly
what I need to do, given that this isn't currently possible with
k5start can you suggest the best way to go about achieving the same
thing?


Just start the whole thing inside pagsh.

Then we use these options to k5start:

/usr/bin/k5start -b -K 10 -l 14d -p /var/run/$prog-k5start.pid -f $keytab -k 
$ccname -t $princ2

Where $keytab is obvious, ccname = /tmp/krb5cc_k5start_wrapped-$prog
$princ2 = -U or $print@$realm (depending on k5start version).

That's taken almost directly from our k5start-wrapper script, which we 
use to wrap init scripts under /etc/init.d/. You create 
/etc/init.d/$prog-afs, set a couple of variables like $keytab, then source 
k5start-wrapper.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] volume size

2011-01-14 Thread Andy Cobaugh

On 2011-01-14 at 11:43, Lewis, Dave ( le...@nki.rfmh.org ) said:

Hi,

I'm wondering what is a reasonable size for large AFS volumes. I
understand that the maximum size of a volume is about 2 TB (assuming
that the partition is at least that size). From a practical standpoint,
is it reasonable to have a 2 TB volume? Should I expect any problems
doing operations like bos salvage or vos move on large volumes?


We've been running with some data volumes in the TB range for a while now 
without problems. Biggest volume right now is ~3.2TB. Splitting these 
large volumes isn't very practical.


vos move will seem to take forever, but we've moved TB scale volumes 
without any problems.


You'll find, however, that around 2TB, some tools will start to report 
negative numbers for volume size, and you can't set a quotas bigger than 
2TB, so you get to set the quota to 0, disabling it entirely.



For example, I'm wondering if bos salvage has a "harder" time with a few
large volumes than with several smaller volumes. I figure that, with
smaller volumes, internal inconsistencies that bos salvage fixes would
be more isolated than with large volumes, and that that would be
beneficial. But I don't really know.


salvages will take longer certainly, but it hasn't had any problems in my 
experience.



Currently we mount 25 GB volumes in users' home directories for their
image data, which grows a lot during data processing. Some users are
starting to feel limited by 25 GB volumes, so I'm considering going to
100 GB volumes. I would appreciate any advice. Should large volumes be
salvaged more often than small volumes?


Gee, how big are their homedirs then ;) Ours start out with a 50GB quota, 
with some hovering somewhere between 100-200GB.


Our rule for allocating space is, don't give them too much right away, or 
they'll use it up in no time. When most people hit their quota, they 
naturally find ways to work within their quota without having to ask for 
more space ;)


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] Re: [OpenAFS-announce] OpenAFS 1.6.0 release candidate 1 available

2011-01-06 Thread Andy Cobaugh

On 2011-01-06 at 16:55, Derrick J Brashear ( sha...@openafs.org ) said:


Please assist the gatekeepers by deploying this release and providing 
positive or negative feedback. Bug reports should be filed to
openafs-b...@openafs.org .  Reports of success should be sent to 
openafs-info@openafs.org .


This needs to get applied to 1.6:

http://git.openafs.org/?p=openafs.git;a=commit;h=97474963e58253f8c891e9f6596403213d53527b

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Package Management in AFS

2010-12-20 Thread Andy Cobaugh

On 2010-12-20 at 19:34, Dirk Heinrichs ( dirk.heinri...@altum.de ) said:

Am 20.12.2010 19:26, schrieb Booker Bense:


My 2 cents... Outside of a few very specialized apps, putting software
in AFS is a losing proposition these days. Since local disk space is
growing so fast, there really is little justification for not simply
using the package management system
of the OS and simply installing locally.


Can't agree more. We use stow to install certain pieces of software into 
AFS, usually one-off and standalone scientific software (we're in 
bioinformatics).


For everything else, we use the package manager. RPMs really are easy to 
make. Perhaps even easier than installing the same app in AFS. Even if 
there was something like rpm for afs, that would only make the two methods 
(installing on local disk or installing in afs) equivalent (ignoring any 
issues of permission). This also assumes you're running the same version 
of the same OS everywhere (for example, we use @sys symlinks, but in our 
environment amd64_linux26 isn't the same everywhere).


Follow the principal of least work: Is it more work to install an app into 
AFS, or yum/apt-get/etc install.



That would again mean that the sw had to be installed over and over
again, on every single machine. That may be OK for 2 or 5 machines, but
for a larger number this becomes a tedious task. And what about diskless
clients?


That's what cfengine or puppet are for. IMO, any time you have to manage 2 
or more machines, you really do need something like cfengine to do 
complete configuration. If you can't blow away entire machines and have 
them automatically reinstall and converge back to their previous state, 
then you're really not managing your systems.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] missing /etc/sysconfig/openafs-client

2010-10-18 Thread Andy Cobaugh

On 2010-10-18 at 13:15, David Bear ( david.b...@asu.edu ) said:

Indeed there is a /usr/vice/etc/cacheinfo.

what concerns me as that this set of rpm's has different configuration files
than the 1.4.10 rpms that I used to use.  The /etc/init.d/openafs-client
file sources /etc/sysconfig/openafs instead of
/etc/sysconfig/openafs-client.


Really? I just pulled down the openafs-client 1.4.10 RPM for RHEL-5 from 
dl.openafs.org, and the /etc/init.d/openafs-client that's included sources
/etc/sysconfig/openafs, and uses the AFSD_ARGS variable for options to 
afsd (things like -memcache, -daemons, -dynroot, -afsdb, etc etc).


In fact, if we look at the git history for 
src/packaging/RedHat/openafs-client.init, it seems like it has always 
sourced /etc/sysconfig/openafs, which is installed by the openafs package 
proper.


Was the 1.4.10 you were running before downloaded from openafs.org 
originally?


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] forcing and update with yum for openafs

2010-10-14 Thread Andy Cobaugh

On 2010-10-14 at 11:08, David Bear ( david.b...@asu.edu ) said:

I removed openafs 1.4.10 -- and installed the 1.4.12 repository rpm.
However, when I try to do a yum install openafs it still wants to grab the
1.4.10 version. I look in /etc/yum.repos.d and see the openafs.repo file
there that points to 1.4.12 .. But I am at a loss as to how to force yum to
use it.


Could it be caching something somewhere? Try a 'yum clean all' ?

What happens if you say 'update' instead of 'install' ? It could be that 
you didn't remove every openafs package, and so a lingering package from 
1.4.10 is specifying openafs = 1.4.1 as a dependency.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs Client with pam krb5 and ldap

2010-10-02 Thread Andy Cobaugh

On 2010-10-01 at 21:50, Russ Allbery ( r...@stanford.edu ) said:

Russ Allbery  writes:


Oh, I understand now.  pam_unix fails, and you were expecting pam_krb5
to return success (blindly) to counter pam_unix's failure, but since
pam_krb5 (correctly) returns PAM_IGNORE for users about which it has no
information, logins are failing because of the pam_unix failure.  Or, if
you remove pam_unix, because all modules in the stack returned
PAM_IGNORE.


Oh, and the other piece I forgot to mention: you saw this start happening
in lenny because in etch pam-krb5 did blindly return PAM_SUCCESS if the
user didn't log in with a password.  This was changed in 3.11:

   pam_setcred, pam_open_session, and pam_acct_mgmt now return PAM_IGNORE
   for ignored users or non-Kerberos logins rather than PAM_SUCCESS.
   This return code tells the PAM library to continue as if the module
   were not present in the configuration and allows sufficient to be
   meaningful for pam-krb5 in account and session groups.


Yeah, I think I remember reading that.

On redhat, account uses pam_unix, pam_krb5, then pam_permit after running 
authconfig and telling it to use ldap and /etc/passwd for authZ, and krb5 
and /etc/shadow for authN, so I think pam_permit may be the right way to 
go.


Thanks for clearing this up.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs Client with pam krb5 and ldap

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 20:30, Russ Allbery ( r...@stanford.edu ) said:


pam_permit of course fixes it because it basically disables the entire
account stack.  Just deleting everything out of the account stack would
presumably also fix it.


The account stack needs /something/ in it or it fails completely.


I wonder if pam_krb5 is a red herring here and what's actually failing is
pam_unix.  Do the accounts you're trying to log in as exist in
/etc/shadow?  Does it work if you remove pam_krb5 and only keep pam_unix?
pam_unix does require all accounts be present in /etc/shadow.


These accounts exist through ldap, so no entries in /etc/shadow.

It fails in the same manner with just pam_krb5.

pam_krb5 and pam_permit together work. Is your pam_krb5 returning nothing 
for pam_sm_acct_mgmt with gssapi ssh logins perhaps?


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs Client with pam krb5 and ldap

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 19:03, Russ Allbery ( r...@stanford.edu ) said:

account [default=ignore ignore=ignore success=ok]   pam_krb5.so debug


That doesn't look like anything that would ever be generated by default
and it isn't in the docs.  I wonder if that's causing your problem.  PAM
stacks can sometimes do really strange things if you set ignore as the
action and it's the last module in the stack.


Well, I didn't put it there, so something did. I've seen that line on 
systems that were originally lenny, and systems that were upgraded to 
lenny.



account requiredpam_unix.so
account requiredpam_krb5.so


Yep, putting exactly those lines in common-account gives 'Connection close 
by foo'. I'm fairly certain I tried every combination of requires, 
sufficient, etc to the same effect. Only thing that made it work was 
putting in a module that returns success always, like pam_permit.



I have to assume there's something really screwy with how something on
your systems is set up or something about the too-complex PAM
configuration isn't working properly, since this just works out of the box
with me with supposedly the same versions of everything.


Well, one system was installed with lenny to begin with, and for over a 
year we would have to do GSSAPIAuthentication=no to login to it, and the 
other 2 systems were upgraded to lenny, which subsequently broke gssapi 
logins in the same manner. Putting in pam_permit on all 3 systems fixed 
them.


Doesn't matter to us so much now, though. At least 2 of these systems will 
be reinstalled with RHEL6 when it comes out, and the third isn't used by 
anyone ssh'ing to it that can make use of gssapi, so...


If you have any other things you want me to try, I will for the sake of 
fixing whatever the real problem is, but no other system has this problem, 
only these lone debian lenny systems.


I should note that the 2 systems that were upgraded were working fine 
before they were bumped up to lenny. All 3 are running the same version of 
libpam-krb5, 3.11-4.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs Client with pam krb5 and ldap

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 10:04, Russ Allbery ( r...@stanford.edu ) said:

Andy Cobaugh  writes:


Two, I'm guessing this is debian?


No, it's not Debian, although the common-* stuff made it look that way.
But that's the Red Hat pam_krb5.


I've had issues making this work with GSSAPI on lenny, and have an
account section like this:



account sufficient  pam_permit.so debug
account requiredpam_unix.so debug



I spent a great deal of time fighting this when we upgraded the couple
remaining debian machines here to lenny.


windlord:~> cat /etc/pam.d/common-account
# /etc/pam.d/common-account -- Authorization settings common to all services.

account required pam_krb5.so
account required pam_unix.so

So I'd be very curious to hear more about what's breaking for you, since
this should just work.  (I'm the author of the pam-krb5 module used in
Debian.)


Sure, debian lenny.

libpam-krb5 = 3.11-4
libpam-afs-session = 1.7-1

If I have just this in common-account:
account requiredpam_unix.so debug

Then I try to login with gssapi ssh:
$ ssh foo
Connection closed by x.x.x.x

Only entry in auth.log:
Oct  1 13:09:23 apollo sshd[25687]: Authorized to phalenor, krb5 principal 
phale...@bx.psu.edu (krb5_kuserok)

So we add in pam_krb5 in system-account like this, which appears to be 
the default entry added when pam-krb5 is installed:

account [default=ignore ignore=ignore success=ok]   pam_krb5.so debug

And same thing, connection closes. This is in auth.log:
Oct  1 13:10:59 apollo sshd[25718]: Authorized to phalenor, krb5 principal 
phale...@bx.psu.edu (krb5_kuserok)
Oct  1 13:10:59 apollo sshd[25718]: (pam_krb5): none: pam_sm_acct_mgmt: entry 
(0x0)
Oct  1 13:10:59 apollo sshd[25718]: (pam_krb5): none: skipping non-Kerberos 
login
Oct  1 13:10:59 apollo sshd[25718]: (pam_krb5): none: pam_sm_acct_mgmt: exit 
(failure)

Only way to make it work is to add in pam_permit. I've done things like 
run sshd with the highest debugging level, among other things, and nothing 
I've done shows any indication that it's even failing. I think I deduced 
that the account routine wasn't returning success, so I tried pam_permit, 
it worked, and I stopped caring why.


I've seen this on every single lenny system I've installed (so, maybe 3?). 
The defaults never worked for me. common-[auth|session|password] are all 
stock otherwise, and /etc/pam.d/sshd directly includes all for common-* 
files.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] scripts to install openafs

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 11:52, Jonathan S Billings ( jsbil...@umich.edu ) said:


When installed, along with the "dkms" package 
(http://download.fedora.redhat.com/pub/epel/5/x86_64/repoview/dkms.html) and 
the GCC compiler, it will automatically compile and install a new openafs.ko 
every time you install a new kernel.


... or it may not. I've seen dkms fail to build openafs.ko too many times. 
After so many users complaining that they didn't have a home directory 
after they turned their machine on and couldn't log in, we gave in and 
switched to using the kmod-openafs package.


Others may have other experiences with dkms, but it's certainly an option 
if it works for you.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Openafs Client with pam krb5 and ldap

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 17:46, Claudio Prono ( claudio.pr...@atpss.net ) said:

/etc/pam.d/common-account

account requisite   pam_unix2.so
account requiredpam_krb5.so use_first_pass
ignore_unknown_principals
account sufficient  pam_localuser.so
account requiredpam_ldap.so use_first_pass


One, if you're using LDAP for user/group info (as configured through 
nsswitch.conf), LDAP never plays into PAM, so you don't need pam_ldap 
anywhere.


Two, I'm guessing this is debian? I've had issues making this work with 
GSSAPI on lenny, and have an account section like this:


account sufficient  pam_permit.so debug
account requiredpam_unix.so debug

I spent a great deal of time fighting this when we upgraded the couple 
remaining debian machines here to lenny.


Others can most likely provide more help than that, just though I'd 
mention the issue with the account section in case that ends up being a 
problem for you.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] scripts to install openafs

2010-10-01 Thread Andy Cobaugh

On 2010-10-01 at 08:39, David Bear ( david.b...@asu.edu ) said:

It seems that once a year I end up getting a kernel update that breaks afs
and then I need to install again... but with a fixed kmod -- or something
else. I know a lot can and should be scripted -- but I've never taken the
time. So I was hoping someone on the list may have a scripted install of afs
for a Red Hat system -- actually, I run CentOS but they are the same enough
that it shouldn't matter. Preferably, the script should look at the
currently running kernel, then know enough to grab the latest stable openafs
rpms and install them -- A followup script would be nice the would either
update the kmod or the dkms afs package as well...  Any pointers, code,
adivce would be helpful.


Here's what we do, which isn't quite what you're asking for, but might 
give you some ideas.


First, We use cfengine2 to handle all of our config management. We have 
everything set up in such a way that cf2 will do /everything/ necessary 
after the initial kickstart, where %post installs cfengine and pulls down 
a basic update.conf to bootstrap cf2.


We publish our own yum repos. We exclude=kernel* in the Base and Updates 
repos. We then place corresponding kmod-openafs and kernel packages in the 
repo. In this way, we can handle exactly when kernels are updated, and can 
be sure that we'll never be in a situation where there aren't kmod-openafs 
packages yet for new kernel packages. We've been in that situation before, 
it sucks ;)


We then have something like this in cfengine:

classes:
centos::
# do we have openafs.ko for the running kernel?
app_openafs_has_module = ( ReturnsZeroShell(/sbin/modinfo openafs 
>/dev/null 2>&1) )
app_openafs_has_kmod_installed = ( ReturnsZeroShell(/bin/rpm -q 
kmod-openafs >/dev/null 2>&1) )

shellcommands:
# dangerous: if kmod-openafs is installed but modinfo openafs returns 
nothing,
# assume kmod-openafs was not installed for the currently running 
kernel and reboot
# with the hopes of booting into a kernel with openafs.ko
centos.app_openafs_has_kmod_installed.!app_openafs_has_module::
"/sbin/shutdown -r +5 \"cfengine\: reboot in 5 minutes to try and fix 
openafs\"" ifelapsed=10 useshell=true

That just handles the state after initial install. There are other fairly 
standard entries in the editfiles, copy, and packages section to make sure 
everything else (ThisCell, /etc/sysconfig/openafs, etc) are in the desired 
state.


As far as rebooting machines to upgrade kernels, people reboot our 
workstations often enough that there's a pretty good chance they're 
running the latest installed kernel. Otherwise we know for sure when 
there's a new kernel to upgrade to, as we'll have dropped the appropriate 
packages in the yum repo, so we can do various things to reboot machines 
after they've done their nightly updates, assuming nobody is logged in.


Hopefully that gives you /some/ idea of how one site handles openafs on 
centos. We've gotten away from using scripts to handle things in favour of 
doing things the cfengine way: defining the desired system state, and 
allowing the machines to converge on their own.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Overview? Linux filesystem choices

2010-09-30 Thread Andy Cobaugh

On 2010-09-30 at 21:00, Robert Milkowski ( mi...@task.gda.pl ) said:

On 30/09/2010 15:12, Andy Cobaugh wrote:


 I don't think anybody has mentioned the block level compression in ZFS
 yet. With simple lzjb compression (zfs set compression=on foo), our AFS
 home directories see ~1.75x compression. That's an extra 1-2TB of disk
 that we don't need to store. Of course that makes balancing vice
 partitions interesting when you can only see the compression ratio at the
 filesystem level and not the volume level.

 Checksums are nice too. There's no longer a question of whether your
 storage hardware wrote what you wanted it to write. This can go a long way
 to helping to predict failures if you run zpool scrub on a regular basis
 (otherwise, zfs only detects checksum mismatches upon read, scrub checks
 the whole pool).

 So, just to add us to the list, we're either ext3 on linux for small stuff
 (<10TB), and zfs on solaris for everything else. Will probably consider
 XFS in the future, however.


Why not ZFS on Solaris x86 for "smaller stuff" as well?


That's just the way things have worked out over the years. "smaller stuff" 
tends to be older machines that were here when I started, and a couple of 
those have hardware raid controllers (like, 3ware pata, for example), that 
will be decom'd soon. There are also cases where the machine with the 
storage attached to it also needs to be used interactively by people 
(like, a PI wants a new machine to run stuff on, but also wants 10TB, 
which we set up as a vice partition so they can access it from any 
machine).


Solaris is great for storage if that's all you use it for, but 
anything else gets to be a pain when people start asking for really weird 
and complicated stuff to be installed.


If I were doing everything over again, we would eliminate all of the 
storage islands, and run all the storage through solaris.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Overview? Linux filesystem choices

2010-09-30 Thread Andy Cobaugh


I don't think anybody has mentioned the block level compression in ZFS yet. 
With simple lzjb compression (zfs set compression=on foo), our AFS home 
directories see ~1.75x compression. That's an extra 1-2TB of disk that we don't 
need to store. Of course that makes balancing vice partitions interesting when 
you can only see the compression ratio at the filesystem level and not the 
volume level.


Checksums are nice too. There's no longer a question of whether your storage 
hardware wrote what you wanted it to write. This can go a long way to helping 
to predict failures if you run zpool scrub on a regular basis (otherwise, zfs 
only detects checksum mismatches upon read, scrub checks the whole pool).


So, just to add us to the list, we're either ext3 on linux for small stuff 
(<10TB), and zfs on solaris for everything else. Will probably consider XFS in 
the future, however.


If you do use ext3, I find it helps sometimes to turn off atime. It might be 
interesting to see what other options, if any, other folks are using for ext3.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] CellServDB

2010-06-18 Thread Andy Cobaugh

On 2010-06-18 at 20:08, Mattias Pantzare ( pant...@ludd.ltu.se ) said:


But maybe the CellServeDB is not really the problem, the problem is
that the client will list all sites in it by default. What if we just
changed the default to not list sites other than the default site
(that the installation program prompts for)?


Or instead of asking OpenAFS to do this, you could do this yourself, using 
a configuration management system or series of shell scripts to make sure 
the CellServDB on all of your clients contains only entries you care 
about. If you're not already using such a system, perhaps you should be.


Any problems beyond that are then a result of the software in question 
assuming everything under / is 'local' and 'fast'.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Modifying the output of vos commands to include server UUIDs

2010-04-14 Thread Andy Cobaugh

On 2010-04-14 at 11:23, Jeffrey Altman ( jalt...@secure-endpoints.com ) said:

On 4/14/2010 10:51 AM, Steve Simmons wrote:


On Apr 13, 2010, at 5:28 PM, Jeffrey Altman wrote:



I'm a long-time fan of having a switch that causes tools to dump their data in 
an easy-to-machine-parse format. That isn't always doable, but when it is, it's 
a big win.


As Andrew pointed out in another reply in this thread, the -format
switch is support to provide that but it fails to provide a consistent
(value - data) pair per line.


Exactly my point. We currently snarf off all that data nightly via script that 
parses the output from vos e -format. It works but was a pain.

Note, tho, that some data doesn't adapt well to single-line output. For example,

just doesn't map well to single-line. We currently deal with this by creating 
four records, each will all the data from the rest of the output and the 
specifics of the four entries above.

Steve


Anyone want a -xml option?


Yes, please. As much as I am not a fan of XML, it would make some of our 
lives easier for those of us using languages that include xml parsers.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Linux packages for 1.5?

2010-04-08 Thread Andy Cobaugh

On 2010-04-07 at 19:38, Simon Wilkinson ( s...@inf.ed.ac.uk ) said:


Those of us actively developing on Linux have been running the 1.5 series for 
ages. The fact that other people are seeing problems would seem to indicate 
that testing across a wider variety of systems is required. Unfortunately, we 
don't have the time, or the systems, to do this by ourselves. If folk are 
interested in getting a stable 1.5 (and 1.6) for Linux any time this 
millenia, then we need more people testing the builds.


This particularly applies to those running old, or non-standard kernels and 
running on odd platforms and architectures. One of the bugs I fixed for Russ 
surfaced exactly because he was running a kernel with slightly out of the 
ordinary memory management.


If RPM packages would help with this please let me know. So far all I have 
heard is silence.


Yes, please - if not for every 1.5.x release, at least for the ones you 
want tested. I'm probably in a position here where I could start pushing
1.5 onto certain desktops/workstations around here, and having ready-made 
RPMs for 1.5 will make that task that much easier.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Maximum size of AFS volume

2010-03-18 Thread Andy Cobaugh

On 2010-03-18 at 08:55, jonathan.whee...@stfc.ac.uk ( 
jonathan.whee...@stfc.ac.uk ) said:

Can someone say what would be maximum possible size of an AFS volume ?
Are there any limits imposed by AFS, or are the limits
filesystem/partition dependent ?  To be more specific, I am talking
about a volume of up to 1 Tb (1000 Gb).


Biggest volume we have right now is ~3TB. I wouldn't recommend going much 
more than 1TB, though. Bigger you go, longer it will take to do certain 
operations, like vos move. This volume in particular is slated to be split 
into smaller volumes in the not-so-distant future (the data doesn't easily 
lend itself to that without having mountpoints in weird places, for 
example, hg18 [human genome] is itself 2.2TB).


As Derrick mentioned, quotas only work up to 2TB. Any more than that and 
you will have to disable the quota for the volume by setting it to 0.


Also, certain tools may report odd numbers when you go beyond 2TB, mostly 
reporting large negative values - though this might be fixed in new enough 
versions, I can never keep track of this stuff.


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] OS X 10.5 and kerberos ssh logins

2009-07-29 Thread Andy Cobaugh

On 2009-07-29 at 14:07, Adeyemi Adesanya ( y...@slac.stanford.edu ) said:


Hi There.

We've had a long standing issue with OS X 10.5 (Leopard) and I just wanted to 
check with folks to see if anyone has solved it. We are able to perform 
Kerberos SSH logins to 10.5 clients using the SSH GSSAPI options 
GSSAPIAuthentication and GSSAPIDelegateCredentials. As long as I have a valid 
kerberos ticket, I can log into my 10.5 systems without supplying a password. 
However, there does not appear to be any sign that the forwarded kerberos 
ticket is cached on the remote system. As a result, I cannot obtain an AFS 
token automatically. This was working for us under 10.4 but we have not found 
a solution for 10.5. Looks like the problem still exists for 10.6 too.


Use the sshd from macports. Apple's sshd is trying to use their credential 
caching mechanism, which would appear to store the credentials in your 
home directory, which if it's in AFS obviously won't work.


Are you able to login at all _without_ GSSAPI, i.e. with a password? 
We're unable to, and that's the only major problem we're still seeing. 
Although come to think about it, this might be alleviated if we use Russ's 
pam_krb5, hmm...


--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] PAG garbage collection on linux

2008-09-15 Thread Andy Cobaugh


I recently ran into a situation where I had a process on linux that gets 
pags probably 10 times per minute on linux (dovecot, fwiw). After 3 weeks 
of uptime after upgrading afs to 1.4.7, we had to reboot the machine, as 
we were no longer able to create new pags.


This is on debian stable, kernel 2.6.18-6-amd64, and afs 1.4.7 as 
previously mentioned.


/proc/sys/afs/GCPAGS gets set to 8, which from my understanding means it 
was unable to walk the process tree.


I see that it works with 1.4.7 on fedora 9 with a 2.6.25 kernel.

I have since reworked my dovecot setup so that imap logins don't 
needlessly get tokens and create a pag (the entire dovecot process stack 
already runs in a pag with tokens for a pts user with appropriate access).


I am curious what combinations of linux kernel / openafs that GC actually 
works. This would be helpful to know, and I'm sure others would benefit 
from this as well.


Thanks.

--andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] Summary of recommended configuration options from the workshop

2008-05-29 Thread Andy Cobaugh


Correct me if I'm wrong, but I seem to recall someone mentioning that 
there were certain cases when running fastrestart where volumes might end 
up being attached even if they need salvaging, leading to data 
loss/corruption?


I would say any benefit you see in running fastrestart would be taken over 
by the chance that you could lose entire volumes to such a bug.


I say the sooner we can get DAFS / 1.5 stable the better. DAFS should make 
your fileservers restart real fast too.


--
Andy
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info