Re: [OpenAFS] connection timed out, how long is the timeout?
On Sun, Feb 04, 2018 at 05:21:16PM -0500, Jeffrey Altman wrote: > On 2/4/2018 7:29 AM, Jose M Calhariz wrote: > > I am chasing the root problem in my infra-structure of afsdb and > > afs-fileservers. Sometimes my afsdb loses quorum in the middle of a > > vos operation or the Linux clients time out talking to the > > file servers. To help diagnose the problem I would like to know how > > long is the timeout and if I can change the time out connections in > > the Debian clients and for the vos operations. > >[...] > > The core of my infra-structure are 4 afsdb running Debian 9, and using > > OpenAFS from Debian 1.6.20, on a shared virtualization platform. The > > file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, > > are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM. > > Jose, > (...) Thank you for your report. I will read it with very much attention this nigth and again tomorrow. I am travelling from FOSDEM to home. > > Jeffrey Altman > AuriStor, Inc. > begin:vcard > fn:Jeffrey Altman > n:Altman;Jeffrey > org:AuriStor, Inc. > adr:Suite 6B;;255 West 94Th Street;New York;New York;10025-6985;United States > email;internet:jalt...@auristor.com > title:Founder and CEO > tel;work:+1-212-769-9018 > note;quoted-printable:LinkedIn: > https://www.linkedin.com/in/jeffreyaltman=0D=0A= > Skype: jeffrey.e.altman=0D=0A= > > url:https://www.auristor.com/ > version:2.1 > end:vcard > Kind regards Jose M Calhariz -- -- De cem favoritos dos reis, noventa e cinco foram enforcados --Napoleão Bonaparte ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] connection timed out, how long is the timeout?
On 2/4/2018 7:54 AM, Dirk Heinrichs wrote: > Am 04.02.2018 um 13:29 schrieb Jose M Calhariz: > >> The core of my infra-structure are 4 afsdb > > Wasn't it so that it's better to have an odd number of DB servers (with > a max. of 5)? The maximum number of ubik servers in an AFS3 cell is 20. This is a protocol constraint. However, due to performance characteristics it is unlikely that anyone could run that number of servers in a production cell. As the server count increases the number of messages that must be exchanged to conduct an election, complete database synchronization recovery, maintain quorum, and complete remote transactions. These messages compete with the application level requests arriving from clients. As the application level calls (vl, pt, ...) increase the risk of delayed processing of disk and vote calls increases which can lead to loss of quorum or remote transaction failures. The reason that odd numbers of servers are preferred is because of the failover properties. one server - single point of failure. outage leads to read and write failures. two servers - single point of failure for writes. only the lowest ipv4 address server can be elected coordinator. if it fails, writes are blocked. If it fails during a write transaction, read transactions on the second server are blocked until the first server recovers. three or four servers - either the first or second lowest ipv4 address servers can be elected coordinator. any one server can fail without loss of write or read. five or six servers - any of the first three lowest ipv4 address servers can be elected coordinator. any two servers can fail without loss of write or read. Although adding a fourth server increases the number of servers that can satisfy read requests, the lack of improved resiliency to failure and the increased risk of quorum loss makes its less desirable. The original poster indicated that his ubik servers are virtual machines. The OpenAFS Rx stack throughput is limited by the clock speed of a single processor core. The 1.6 ubik stack is further limited by the need to share a single processor core with all of the vote, disk and application call processing. As a result, anything that increases the overhead reduces increases the risk of quorum failures. This includes virtualization as well as the overhead imposed as a result of Meltdown and Spectre fixes. Meltdown and Spectre can provided a double whammy as a result of increased overhead both within the virtual machine and within the host's virtualization layer. AuriStor's UBIK variant does not suffer the scaling problems of AFS3 UBIK. AuriStor's UBIK has been successfully tested with 80 ubik servers in a cell. This is possible because of a more efficient protocol that is incompatible with AFS3 UBIK and the efficiencies in AuriStor's Rx implementation. Jeffrey Altman AuriStor, Inc. <> smime.p7s Description: S/MIME Cryptographic Signature
Re: [OpenAFS] connection timed out, how long is the timeout?
On 2/4/2018 7:29 AM, Jose M Calhariz wrote: > I am chasing the root problem in my infra-structure of afsdb and > afs-fileservers. Sometimes my afsdb loses quorum in the middle of a > vos operation or the Linux clients time out talking to the > file servers. To help diagnose the problem I would like to know how > long is the timeout and if I can change the time out connections in > the Debian clients and for the vos operations. >[...] > The core of my infra-structure are 4 afsdb running Debian 9, and using > OpenAFS from Debian 1.6.20, on a shared virtualization platform. The > file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, > are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM. Jose, There is unlikely to be a single problem but since I'm procrastinating and curious I decided to perform some research on your cell. This research is the type of analysis that AuriStor performs on behalf of our support customers. Many of the problems you are experiencing with OpenAFS are likely due to or exacerbated by architectural limitations that are simply not present in AuriStorFS. Your cell has four db servers afs01 through afs04 with associated IP addresses that rank the servers from afs01 through afs04. therefore afs01 is the preferred coordinator (sync site) and if its not running afs02 will be elected. Given there are four servers it is not possible for afs03 or afs04 to be elected. There are of course multiple independent ubik database services (vl, pt, and bu) and it is possible for quorum to exist for one and not for others. The vl service is used to store volume location information as well as fileserver/volserver location information. vl entries are modified when a fileserver restarts, when a vos command locks and unlocks an entry, or creates, updates or deletes an entry. Its primary consumer is the afs client which queries volume and file server location information. The pt service stores user and group entries. pt entries are modified by pts when new user entries are created, modified or deleted; and when groups are created, modified or deleted; or when group membership information is modified. The primary consumer is the fileserver which queries the pt service for user and host current protection sets each time a client establishes an rxkad connection to the fileserver. The vl and pt services are of course ubik services. Therefore each vlserver and ptserver process also offers the ubik disk and vote services which are critical. The vote service is used to hold elections, distribute current database version info, and maintain quorum. The disk service is used to distribute the database, update the database, and maintain database consistency. It should be noted that the vote service is time sensitive in that packets that are used to request votes from peers and the responses only have a limited valid lifetime. Some statistics regarding your vl service. Each server is configured with 16 LWP threads. afs03 and afs04 have both failed to service calls in a timely fashion since the last restart. If those failures were vote or disk calls then the coordinator would mark afs03 and afs04 as unreachable, force a recovery operation, and if both were marked down across an election could result in lose of quorum. Since the last restart afs01 has processed 1894352 vl transactions, afs02 1075698 transactions, afs03 2059186 transactions, and afs04 1403592 transactions. That will provide you some idea of the load balancing across your cache managers. The coordinator of course is the only one to handle write transactions; the rest are read transactions. For the pt service the transaction counts are afs01 1818212, afs02 1619962, afs03 1554918, and afs04 1075620. Roughly on par with the vl service load. Like the vl service each server has 16 LWP threads. However, unlike the vl service the pt service is not keeping up with the requests. Since the last restart all four servers have failed to service incoming calls in a timely manner thousands of times each. The pt service failing to be responsive is a problem because it has ripple effects on the file servers. The longer it takes a fileserver to query the CPS data the longer it takes to accept a new connection from a cache manager. The ubik services in all versions of OpenAFS prior to the 1.8 branch have been built as LWP (cooperatively threaded) processes. There is only a single thread in the process that swaps context state. The rx threads (listener, event, ...), the vote, disk, and application (vl, pt, bu, ...) contexts are swapped in either upon a blocking event or a yield. Failure of a context to yield blocks other activities including reading packets, processing requests, etc. Like AuriStorFS the OpenAFS 1.8 series converts the ubik services (vl, pt, bu) to native threading. This will permit the vote and disk services and the rx threads (listener, event,...) to operate with greater parallelism. Unlike
Re: [OpenAFS] connection timed out, how long is the timeout?
On Sun, Feb 04, 2018 at 01:27:07PM -0600, Benjamin Kaduk wrote: > On Sun, Feb 04, 2018 at 12:29:30PM +, Jose M Calhariz wrote: > > > > Hi, > > > > I am chasing the root problem in my infra-structure of afsdb and > > afs-fileservers. Sometimes my afsdb loses quorum in the middle of a > > It is a pretty disruptive event to lose quorum; do you have any idea > what might be responsible for that happening? In recent times I have seen two times a "vos release" of a critical volume to fail. I may have wrongly interpreted the error message. So I past it here the last one: Could not release lock on the VLDB entry for volume XXX u: major synchronization error Error in vos release command. u: major synchronization error > > > vos operation or the Linux clients time out talking to the > > file servers. To help diagnose the problem I would like to know how > > long is the timeout and if I can change the time out connections in > > the Debian clients and for the vos operations. My plan is to increase and > > The ubik election to determine quorum happens every SMALLTIME (60) > seconds, but normally the current coordinator will retain that role > and operations can span multiple election cycles. > > Most of the timeouts involved (e.g., RX_IDLE_DEAD_TIME and > AFS_RXDEADTIME) are also on the order of a minute. > > I think you'd need to recompile in order to adjust these timeouts, > though. And I really would recommend tracking down why you're > losing quorum before trying to paper over things with longer > timeouts. I am too chasing a second problem where a Debian OpenAFS client fail to comunicate with the fileserver and this problem is frequent. May I think that this timeout is about 60 seconds? And that I need to recompile the client to increase or decrease the timeout? > > -Ben > > > decrease the timeouts in OpenAFS and other timeouts in Linux to > > identify if I have a possible problem with the data network, iSCSI > > network, overload on the hosts of VM, overload on the file servers or > > other possible problem. > > > > The core of my infra-structure are 4 afsdb running Debian 9, and using > > OpenAFS from Debian 1.6.20, on a shared virtualization platform. The > > file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, > > are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM. > > > > > > Kind regards > > Jose M Calhariz > > > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info > Kind regards Jose M Calhariz -- -- .adanibober odnes enilgaT .edraugA ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] connection timed out, how long is the timeout?
On Sun, Feb 04, 2018 at 12:29:30PM +, Jose M Calhariz wrote: > > Hi, > > I am chasing the root problem in my infra-structure of afsdb and > afs-fileservers. Sometimes my afsdb loses quorum in the middle of a It is a pretty disruptive event to lose quorum; do you have any idea what might be responsible for that happening? > vos operation or the Linux clients time out talking to the > file servers. To help diagnose the problem I would like to know how > long is the timeout and if I can change the time out connections in > the Debian clients and for the vos operations. My plan is to increase and The ubik election to determine quorum happens every SMALLTIME (60) seconds, but normally the current coordinator will retain that role and operations can span multiple election cycles. Most of the timeouts involved (e.g., RX_IDLE_DEAD_TIME and AFS_RXDEADTIME) are also on the order of a minute. I think you'd need to recompile in order to adjust these timeouts, though. And I really would recommend tracking down why you're losing quorum before trying to paper over things with longer timeouts. -Ben > decrease the timeouts in OpenAFS and other timeouts in Linux to > identify if I have a possible problem with the data network, iSCSI > network, overload on the hosts of VM, overload on the file servers or > other possible problem. > > The core of my infra-structure are 4 afsdb running Debian 9, and using > OpenAFS from Debian 1.6.20, on a shared virtualization platform. The > file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, > are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM. > > > Kind regards > Jose M Calhariz > > -- > -- > > A Coca-Cola encarna a verdadeira beleza do capitalismo. Ela é uma espécie de > religião secular, sem ensinamento moral nem outro mandamento que não seja o > aumento do consumo de sua bebida > > --Mark Pendergrast > ___ > OpenAFS-info mailing list > OpenAFS-info@openafs.org > https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] connection timed out, how long is the timeout?
On Sun, Feb 04, 2018 at 01:54:26PM +0100, Dirk Heinrichs wrote: > Am 04.02.2018 um 13:29 schrieb Jose M Calhariz: > > > The core of my infra-structure are 4 afsdb > > Wasn't it so that it's better to have an odd number of DB servers (with > a max. of 5)? Yes, it would be better with an odd number. For historical reasons is stuck on 4. But I think this is not the root cause of my problem. > > Bye... > > Dirk > Kind regards Jose M Calhariz -- -- A Coca-Cola encarna a verdadeira beleza do capitalismo. Ela é uma espécie de religião secular, sem ensinamento moral nem outro mandamento que não seja o aumento do consumo de sua bebida --Mark Pendergrast ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] connection timed out, how long is the timeout?
Am 04.02.2018 um 13:29 schrieb Jose M Calhariz: > The core of my infra-structure are 4 afsdb Wasn't it so that it's better to have an odd number of DB servers (with a max. of 5)? Bye... Dirk -- Dirk HeinrichsGPG Public Key: D01B367761B0F7CE6E6D81AAD5A2E54246986015 Sichere Internetkommunikation: http://www.retroshare.org Privacy Handbuch: https://www.privacy-handbuch.de signature.asc Description: OpenPGP digital signature
[OpenAFS] connection timed out, how long is the timeout?
Hi, I am chasing the root problem in my infra-structure of afsdb and afs-fileservers. Sometimes my afsdb loses quorum in the middle of a vos operation or the Linux clients time out talking to the file servers. To help diagnose the problem I would like to know how long is the timeout and if I can change the time out connections in the Debian clients and for the vos operations. My plan is to increase and decrease the timeouts in OpenAFS and other timeouts in Linux to identify if I have a possible problem with the data network, iSCSI network, overload on the hosts of VM, overload on the file servers or other possible problem. The core of my infra-structure are 4 afsdb running Debian 9, and using OpenAFS from Debian 1.6.20, on a shared virtualization platform. The file-servers running Debian 9 and using OpenAFS from Debian, 1.6.20, are VMs in dedicated hosts for OpenAFS on top of libvirt/KVM. Kind regards Jose M Calhariz -- -- A Coca-Cola encarna a verdadeira beleza do capitalismo. Ela é uma espécie de religião secular, sem ensinamento moral nem outro mandamento que não seja o aumento do consumo de sua bebida --Mark Pendergrast ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out on new mount point
Am 02.12.2016 um 17:48 schrieb Jeffrey Altman: > The client has cached information for the volume group that indicates > that no backup volume exists. > > fs checkvolumes That solved it, indeed. Thanks a lot. Bye... Dirk -- Dirk HeinrichsGPG Public Key CB614542 | Jabber: dirk.heinri...@altum.de Tox: he...@toxme.se Sichere Internetkommunikation: http://www.retroshare.org Privacy Handbuch: https://www.privacy-handbuch.de ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out on new mount point
On 12/2/2016 11:35 AM, Dirk Heinrichs wrote: > Hi, > > I'm currently facing a strange problem with connection timeouts after > creating a mount point (fs mkm) for a new volume: > > # fs mkm tester home.tester.backup > # ll > ls: cannot access 'tester': Connection timed out > total 132K > ... > ?? ? ? ? ?? tester > > The mount point has been created from a client workstation and only > becomes available there after reboot or cache manager restart. OTOH, > it's accessible immediately on the server (where /afs is usually not > accessed): > > # ll > total 134K > ... > drwx-- 2 1005 1001 2.0K Dec 1 21:49 tester > > Both server and client are up-to-date Debian Stretch systems running > OpenAFS 1.6.18.3. > > Any ideas what could be causing the problem? > > Thanks... > > Dirk The client has cached information for the volume group that indicates that no backup volume exists. fs checkvolumes Jeffrey Altman <> smime.p7s Description: S/MIME Cryptographic Signature
[OpenAFS] Connection timed out on new mount point
Hi, I'm currently facing a strange problem with connection timeouts after creating a mount point (fs mkm) for a new volume: # fs mkm tester home.tester.backup # ll ls: cannot access 'tester': Connection timed out total 132K ... ?? ? ? ? ?? tester The mount point has been created from a client workstation and only becomes available there after reboot or cache manager restart. OTOH, it's accessible immediately on the server (where /afs is usually not accessed): # ll total 134K ... drwx-- 2 1005 1001 2.0K Dec 1 21:49 tester Both server and client are up-to-date Debian Stretch systems running OpenAFS 1.6.18.3. Any ideas what could be causing the problem? Thanks... Dirk -- Dirk HeinrichsGPG Public Key CB614542 | Jabber: dirk.heinri...@altum.de Tox: he...@toxme.se Sichere Internetkommunikation: http://www.retroshare.org Privacy Handbuch: https://www.privacy-handbuch.de ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out - problem with cache manager?
Iam not sure. I dont know your kernel version. Maybe the reason is the old afs client module version. There was a problem with the splice kernel function since kernel 4.4 and backports. We are using the openafs ppa repository (https://launchpad.net/~openafs/+archive/ubuntu/stable) on Ubuntu below Ubuntu 16.10 because this problem is solved in openafs >= 1.6.18 which isnt part of Ubuntu repo. below 16.10. I hope this help you. regards, Andreas > Some users at our site reports problems with downloading files > directly to AFS (and this problem has existed for years). > > I'm now working to try to find the cause. Just to eliminate the > server, we have moved the user's volume to our YFS server, but we > experience exactly the same problem. > > I can't seem to reproduce it on my own machine (Ubuntu 14.04.1 LTS > with openafs client 1.6.7-1ubuntu1.1). > > However, the machine where I have managed to reproduce the problem is > a terminal server (with lots of users). It's a Ubuntu 12.04.5 LTS with > openafs version 1.6.1-1+ubuntu0.7. > > The AFS cache is set to: > > cat /etc/openafs/cacheinfo > /afs:/cache/openafs:500 > > > What happens is this: > I run a wget (from siemens in this case, but probably not important). > The wget either aborts at 70% or so, with a "Connection timed out", > or, as happened for me just now: > > HTTP request sent, awaiting response... 200 OK > Length: 1983588866 (1,8G) [application/zip] > Saving to: `nx-9.0.3.zip.1' > > 100%[>] 1 983 588 866 17,7M/s in > 1m 50s > > utime(nx-9.0.3.zip.1): Connection timed out > 2016-11-30 11:33:39 (17,3 MB/s) - `nx-9.0.3.zip.1' saved > [1983588866/1983588866] > > So, the file downloaded 100% (to the AFS cache). Then there was a > delay for some time before the error popped up (while flushing the > cache, I would guess). > > If I look at the resulting file, I see that it's corrupt. > > Downloading to local disk first, and then copy to AFS seems to work > every time. > > Does anyone recognize this problem? > > /Staffan > smime.p7s Description: S/MIME Cryptographic Signature
[OpenAFS] Connection timed out - problem with cache manager?
Some users at our site reports problems with downloading files directly to AFS (and this problem has existed for years). I'm now working to try to find the cause. Just to eliminate the server, we have moved the user's volume to our YFS server, but we experience exactly the same problem. I can't seem to reproduce it on my own machine (Ubuntu 14.04.1 LTS with openafs client 1.6.7-1ubuntu1.1). However, the machine where I have managed to reproduce the problem is a terminal server (with lots of users). It's a Ubuntu 12.04.5 LTS with openafs version 1.6.1-1+ubuntu0.7. The AFS cache is set to: > cat /etc/openafs/cacheinfo /afs:/cache/openafs:500 What happens is this: I run a wget (from siemens in this case, but probably not important). The wget either aborts at 70% or so, with a "Connection timed out", or, as happened for me just now: HTTP request sent, awaiting response... 200 OK Length: 1983588866 (1,8G) [application/zip] Saving to: `nx-9.0.3.zip.1' 100%[>] 1 983 588 866 17,7M/s in 1m 50s utime(nx-9.0.3.zip.1): Connection timed out 2016-11-30 11:33:39 (17,3 MB/s) - `nx-9.0.3.zip.1' saved [1983588866/1983588866] So, the file downloaded 100% (to the AFS cache). Then there was a delay for some time before the error popped up (while flushing the cache, I would guess). If I look at the resulting file, I see that it's corrupt. Downloading to local disk first, and then copy to AFS seems to work every time. Does anyone recognize this problem? /Staffan smime.p7s Description: S/MIME Cryptographic Signature
[OpenAFS] Connection timed out and device doesn't exist finally solved
Very very odd behavior. To put it in short.. an entire fileserver's RW volumes became unavailable to our colo sites, but not the local site. Every effort to determine the cause was met with frustration (all sorts of cachemanager operations yielded nothing) That is, until I did an fs whereis on the affected volume, on the fileserver machine itself... It told me the RW volume was available on host 192.168.122.1. Formerly a virtual host bridge interface, but no longer used. VLDB did not show this.. syncserv and syncvldb's had not fixed the problem. Restarting the fileserver process did not release it, even though the IP was no longer active. So I moved one volume. That worked. But I didn't want to do that for the entire fileserver. So I entered -rxbind to the fileserver process and restarted it. Voila. Problem solved. -- Timothy Balcer / IT Services Telmate / San Francisco, CA Direct / (415) 300-4313 Customer Service / (800) 205-5510
Re: [OpenAFS] Connection Timed Out errors occasionally when accessing openafs drive
I upgraded our server and client to 1.4.10. Unfortunately, I am still receiving Connection Timed Out errors. They rarely occur, but when they do they are a severe hindrance. My use case is as follows: Three different unix user accounts (root, www-data, aux) are all running multiple background processes (~9 total) which access the afs mount. They each automatically acquire, or re-acquire tickets and tokens, and then proceed to read, copy, and write files. Occasionally, upon creating a directory using a python os command similar to mkdir -p (os.makedirs), I receive a Connection Timed Out error. The processes must then be restarted. Any other suggestions? Ken On Sun, May 10, 2009 at 7:41 PM, Derrick Brashear sha...@gmail.com wrote: it probably matters in the server here, but both. Derrick On May 10, 2009, at 10:35 PM, Ken Elkabany k...@elkabany.com wrote: Is this bug fixed in the client or the server? Thanks. Ken On Sun, May 10, 2009 at 7:22 PM, Derrick Brashear sha...@gmail.com wrote: I'd venture this is a bug fixed in 1.4.10, with idle dead time computation in rx. Derrick On May 10, 2009, at 9:53 PM, Ken Elkabany k...@elkabany.com wrote: Hello, I have openafs 1.4.9 client and server running on two separate machines across a WAN. The client has scripts that access the /afs/our.cell/ directory. Occasionally, the script will fail to complete, and the logs will say that the Connection Timed Out on a mkdir -p /afs/our.cell/x/y/z command. The frequency of the errors are approximately 1 in 100, small enough to not be easily reproducible manually, but enough to hamper our project. The scripts run as the root user, and is guaranteed to have the proper ticket and token. It's also important to note that these scripts often run in parallel (4 at a time, all root, modifying our cell). When one fails, all scripts running concurrently will fail with the same error, and I typically either unlog;kdestroy or restart the openafs-client (I am unsure which of those solutions is necessary or sufficient). I will soon have an additional LAN setup, and will determine if the same error occurs. Has anyone dealt with this issue before? Thank you for the assistance, Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Connection Timed Out errors occasionally when accessing openafs drive
Hello, I have openafs 1.4.9 client and server running on two separate machines across a WAN. The client has scripts that access the /afs/our.cell/ directory. Occasionally, the script will fail to complete, and the logs will say that the Connection Timed Out on a mkdir -p /afs/our.cell/x/y/z command. The frequency of the errors are approximately 1 in 100, small enough to not be easily reproducible manually, but enough to hamper our project. The scripts run as the root user, and is guaranteed to have the proper ticket and token. It's also important to note that these scripts often run in parallel (4 at a time, all root, modifying our cell). When one fails, all scripts running concurrently will fail with the same error, and I typically either unlog;kdestroy or restart the openafs-client (I am unsure which of those solutions is necessary or sufficient). I will soon have an additional LAN setup, and will determine if the same error occurs. Has anyone dealt with this issue before? Thank you for the assistance, Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection Timed Out errors occasionally when accessing openafs drive
I'd venture this is a bug fixed in 1.4.10, with idle dead time computation in rx. Derrick On May 10, 2009, at 9:53 PM, Ken Elkabany k...@elkabany.com wrote: Hello, I have openafs 1.4.9 client and server running on two separate machines across a WAN. The client has scripts that access the /afs/our.cell/ directory. Occasionally, the script will fail to complete, and the logs will say that the Connection Timed Out on a mkdir -p /afs/our.cell/x/y/z command. The frequency of the errors are approximately 1 in 100, small enough to not be easily reproducible manually, but enough to hamper our project. The scripts run as the root user, and is guaranteed to have the proper ticket and token. It's also important to note that these scripts often run in parallel (4 at a time, all root, modifying our cell). When one fails, all scripts running concurrently will fail with the same error, and I typically either unlog;kdestroy or restart the openafs-client (I am unsure which of those solutions is necessary or sufficient). I will soon have an additional LAN setup, and will determine if the same error occurs. Has anyone dealt with this issue before? Thank you for the assistance, Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection Timed Out errors occasionally when accessing openafs drive
Is this bug fixed in the client or the server? Thanks. Ken On Sun, May 10, 2009 at 7:22 PM, Derrick Brashear sha...@gmail.com wrote: I'd venture this is a bug fixed in 1.4.10, with idle dead time computation in rx. Derrick On May 10, 2009, at 9:53 PM, Ken Elkabany k...@elkabany.com wrote: Hello, I have openafs 1.4.9 client and server running on two separate machines across a WAN. The client has scripts that access the /afs/our.cell/ directory. Occasionally, the script will fail to complete, and the logs will say that the Connection Timed Out on a mkdir -p /afs/our.cell/x/y/z command. The frequency of the errors are approximately 1 in 100, small enough to not be easily reproducible manually, but enough to hamper our project. The scripts run as the root user, and is guaranteed to have the proper ticket and token. It's also important to note that these scripts often run in parallel (4 at a time, all root, modifying our cell). When one fails, all scripts running concurrently will fail with the same error, and I typically either unlog;kdestroy or restart the openafs-client (I am unsure which of those solutions is necessary or sufficient). I will soon have an additional LAN setup, and will determine if the same error occurs. Has anyone dealt with this issue before? Thank you for the assistance, Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection Timed Out errors occasionally when accessing openafs drive
it probably matters in the server here, but both. Derrick On May 10, 2009, at 10:35 PM, Ken Elkabany k...@elkabany.com wrote: Is this bug fixed in the client or the server? Thanks. Ken On Sun, May 10, 2009 at 7:22 PM, Derrick Brashear sha...@gmail.com wrote: I'd venture this is a bug fixed in 1.4.10, with idle dead time computation in rx. Derrick On May 10, 2009, at 9:53 PM, Ken Elkabany k...@elkabany.com wrote: Hello, I have openafs 1.4.9 client and server running on two separate machines across a WAN. The client has scripts that access the /afs/our.cell/ directory. Occasionally, the script will fail to complete, and the logs will say that the Connection Timed Out on a mkdir -p /afs/our.cell/x/y/z command. The frequency of the errors are approximately 1 in 100, small enough to not be easily reproducible manually, but enough to hamper our project. The scripts run as the root user, and is guaranteed to have the proper ticket and token. It's also important to note that these scripts often run in parallel (4 at a time, all root, modifying our cell). When one fails, all scripts running concurrently will fail with the same error, and I typically either unlog;kdestroy or restart the openafs-client (I am unsure which of those solutions is necessary or sufficient). I will soon have an additional LAN setup, and will determine if the same error occurs. Has anyone dealt with this issue before? Thank you for the assistance, Ken ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out?
During this test we encounter 'Permission denied' errors, which seem to coincide with 'kernel: afs: failed to store file (110)' entries in /var/log/messages. 110=Connection timed out. The fileserver is busy but responsive, about 25 builds (out of 50) complete normally. I don't know if this is a coincidence or not. I have 1.4.8 clients that does not behave against a 1.4.2 (yeah, I know...) server: Mar 11 13:21:18 a03c11n14 kernel: afs: Waiting for busy volume 537086116 (prj.sbc.aronh.13) in cell pdc.kth.se Mar 11 13:21:20 a03c11n14 kernel: afs: failed to store file (network problems) Mar 11 13:23:33 a03c11n14 last message repeated 3 times Mar 11 13:25:26 a03c11n14 last message repeated 4 times Mar 11 13:27:23 a03c11n14 last message repeated 4 times Mar 11 13:29:30 a03c11n14 last message repeated 4 times Mar 11 13:31:37 a03c11n14 last message repeated 4 times Mar 11 13:33:39 a03c11n14 last message repeated 4 times Mar 11 13:35:36 a03c11n14 last message repeated 4 times Mar 11 13:37:34 a03c11n14 last message repeated 4 times Mar 11 13:39:38 a03c11n14 last message repeated 4 times Then silence. Console said something like: Call Trace: ... system_call+0x7e/0x83 do_sys_open+0x5c/0xbe .. Kernel panic - not syncing: Fatal exception As this is (eh, was) a parallell job several but not all clients involved did crash like this. Unfortunately, I have no way how to repeat. I have moved the volume to a 1.4.8 server to start with. Harald. ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Connection timed out?
L.S., We are evaluating OpenAFS for use with 50 clients. One of the tests is a kernel build on 50 clients at the same time. During this test we encounter 'Permission denied' errors, which seem to coincide with 'kernel: afs: failed to store file (110)' entries in /var/log/messages. 110=Connection timed out. The fileserver is busy but responsive, about 25 builds (out of 50) complete normally. We are running 1.4.8 client server, kernel 2.6.18 64-bits. Currently all server processes run on the same server. Fileserver settings: /usr/afs/bin/fileserver -p 128 -b 512 -l 3072 -s 3072 -vc 3072 -cb 65536 -busyat 1536 -rxpck 1024 -nojumbo What are we doing wrong (except for the way we test;-))? Regards, Robbert -- Robbert Eggermont Information Communication Theory r.eggerm...@tudelft.nl Electr.Eng., Mathematics Comp.Science +31 (15) 2783234Delft University of Technology ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out?
On Tue, 10 Mar 2009, Robbert Eggermont wrote: L.S., We are evaluating OpenAFS for use with 50 clients. One of the tests is a kernel build on 50 clients at the same time. During this test we encounter 'Permission denied' errors, which seem to coincide with 'kernel: afs: failed to store file (110)' entries in /var/log/messages. 110=Connection timed out. The fileserver is busy but responsive, about 25 builds (out of 50) complete normally. We are running 1.4.8 client server, kernel 2.6.18 64-bits. Currently all server processes run on the same server. Fileserver settings: /usr/afs/bin/fileserver -p 128 -b 512 -l 3072 -s 3072 -vc 3072 -cb 65536 -busyat 1536 -rxpck 1024 -nojumbo The number of threads seems to be more than appropriate for 50 clients. It might be interesting to look at the output of rxdebug server 7000 during a build, especially the top, where it tells you about waiting calls and idle threads. Regards Felix ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out?
Felix Frank wrote: The number of threads seems to be more than appropriate for 50 clients. It might be interesting to look at the output of rxdebug server 7000 during a build, especially the top, where it tells you about waiting calls and idle threads. The test consists of an untar, make -j2, and rm. The connection timeouts started at about 22:05 (during the make). rxde...@server: 2009-03-09T21:15+0100: Trying 127.0.0.1 (port 7000): Free packets: 2891, packet reclaims: 10968, calls: 14533306, used FDs: 20 not waiting for packets. 0 calls waiting for a thread 123 threads are idle 2009-03-09T21:20+0100: Trying 127.0.0.1 (port 7000): Free packets: 2496, packet reclaims: 10968, calls: 14806865, used FDs: 61 not waiting for packets. 0 calls waiting for a thread 78 threads are idle 2009-03-09T21:25+0100: Trying 127.0.0.1 (port 7000): Free packets: 2067, packet reclaims: 10968, calls: 15155769, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 86 threads are idle 2009-03-09T21:30+0100: Trying 127.0.0.1 (port 7000): Free packets: 2361, packet reclaims: 10968, calls: 15451575, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 87 threads are idle 2009-03-09T21:35+0100: Trying 127.0.0.1 (port 7000): Free packets: 2361, packet reclaims: 10968, calls: 15888390, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 99 threads are idle 2009-03-09T21:40+0100: Trying 127.0.0.1 (port 7000): Free packets: 2382, packet reclaims: 10968, calls: 16312797, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 96 threads are idle 2009-03-09T21:45+0100: Trying 127.0.0.1 (port 7000): Free packets: 2551, packet reclaims: 10968, calls: 17050004, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 105 threads are idle 2009-03-09T21:50+0100: Trying 127.0.0.1 (port 7000): Free packets: 2697, packet reclaims: 10968, calls: 17827397, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 99 threads are idle 2009-03-09T21:55+0100: Trying 127.0.0.1 (port 7000): Free packets: 2574, packet reclaims: 10968, calls: 18517191, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 103 threads are idle 2009-03-09T22:00+0100: Trying 127.0.0.1 (port 7000): Free packets: 2562, packet reclaims: 10968, calls: 19140482, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 90 threads are idle 2009-03-09T22:05+0100: Trying 127.0.0.1 (port 7000): Free packets: 1466, packet reclaims: 11269, calls: 19335878, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 40 threads are idle 2009-03-09T22:10+0100: Trying 127.0.0.1 (port 7000): Free packets: 1219, packet reclaims: 12979, calls: 19414589, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 43 threads are idle 2009-03-09T22:15+0100: Trying 127.0.0.1 (port 7000): Free packets: 2484, packet reclaims: 14897, calls: 19466551, used FDs: 64 not waiting for packets. 0 calls waiting for a thread 84 threads are idle upt...@server: 21:20:02 up 27 days, 4:34, 9 users, load average: 6.14, 2.46, 0.95 21:25:01 up 27 days, 4:39, 9 users, load average: 3.72, 3.92, 2.05 21:30:01 up 27 days, 4:44, 9 users, load average: 5.04, 3.94, 2.50 21:35:02 up 27 days, 4:49, 9 users, load average: 5.72, 4.82, 3.26 21:40:01 up 27 days, 4:54, 9 users, load average: 7.06, 5.53, 3.95 21:45:01 up 27 days, 4:59, 9 users, load average: 10.97, 8.74, 5.73 21:50:02 up 27 days, 5:04, 10 users, load average: 4.00, 7.05, 5.94 21:55:02 up 27 days, 5:09, 10 users, load average: 4.29, 5.32, 5.46 22:00:02 up 27 days, 5:14, 10 users, load average: 8.73, 8.09, 6.68 22:05:02 up 27 days, 5:19, 10 users, load average: 2.99, 5.27, 5.89 22:10:02 up 27 days, 5:24, 10 users, load average: 2.38, 3.75, 5.07 22:15:02 up 27 days, 5:29, 10 users, load average: 4.29, 3.44, 4.51 The first peak is during the untar, the second during the make. After ~10 clients timed out, the load went down a bit. rxdebug localhost -rxstats -long (from this morning): Trying 127.0.0.1 (port 7000): Free packets: 2895, packet reclaims: 18020, calls: 22235421, used FDs: 13 not waiting for packets. 0 calls waiting for a thread 123 threads are idle rx stats: free packets 2895, allocs 367120898, alloc-failures(rcv 0/0,send 0/0,ack 0) greedy 0, bogusReads 0 (last from host 0), noPackets 0, noBuffers 0, selects 0, sendSelects 0 packets read: data 327835144 ack 33311295 busy 0 abort 3 ackall 0 challenge 1066 response 610 debug 654 params 0 unused 0 unused 0 unused 0 version 0 other read counters: data 327835144, ack 33311295, dup 3574 spurious 0 dally 0 packets sent: data 38234254 ack 206138290 busy 0 abort 3072 ackall 0 challenge 626 response 1066 debug 0 params 0 unused 0 unused 0 unused 0 version 0 other send counters: ack 206138290, data 76468508 (not resends), resends 18183, pushed 0,
Re: [OpenAFS] Connection timed out?
Robbert Eggermont wrote: L.S., We are evaluating OpenAFS for use with 50 clients. One of the tests is a kernel build on 50 clients at the same time. During this test we encounter 'Permission denied' errors, which seem to coincide with 'kernel: afs: failed to store file (110)' entries in /var/log/messages. 110=Connection timed out. The fileserver is busy but responsive, about 25 builds (out of 50) complete normally. We are running 1.4.8 client server, kernel 2.6.18 64-bits. Currently all server processes run on the same server. Fileserver settings: /usr/afs/bin/fileserver -p 128 -b 512 -l 3072 -s 3072 -vc 3072 -cb 65536 -busyat 1536 -rxpck 1024 -nojumbo What are we doing wrong (except for the way we test;-))? Regards, Robbert My feeling is that here the famous new (with 1.4.8) idleDead mechanism plays a role. It would be interesting whether the same happens on 1.4.7 clients or not. Hartmut smime.p7s Description: S/MIME Cryptographic Signature
[OpenAFS] connection timed out after salvage completes
My fileserver seems to want to salvage every time the machine boots, but that's another story... It seems that if I access a volume being salvaged (OpenAFS 1.4.1 Linux client), I get the usual connection timed out error... but once the volume finishes salvaging and comes on-line (and other clients can access it), the client that got the error continues getting the error for several minutes. Is this the expected behavior, or should I narrow down the problem further and file a bug report? - a -- PGP/GPG: 5C9F F366 C9CF 2145 E770 B1B8 EFB1 462D A146 C380 ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] connection timed out after salvage completes
Adam Megacz [EMAIL PROTECTED] writes: My fileserver seems to want to salvage every time the machine boots, but that's another story... Make sure your system shutdown process is cleanly shutting down the file server. It seems that if I access a volume being salvaged (OpenAFS 1.4.1 Linux client), I get the usual connection timed out error... but once the volume finishes salvaging and comes on-line (and other clients can access it), the client that got the error continues getting the error for several minutes. Is this the expected behavior, or should I narrow down the problem further and file a bug report? It's expected; when a file server is down, the cache manager will mark the host as down and won't retry for some interval (five minutes sticks in my head). You can force an immediate check with fs checkservers. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/ ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
[OpenAFS] Connection timed out
Hallo, i use OpenAFS 1.4 , MIT Kerberos i can successfully acquire a ticket and aklog run i got the error fs:'/afs': Connection timed out when i tried to run fs setacl /afs system:anyuser rl can anyone help? thanks Amir Saad Software Engineer ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out
On Mon, 23 Jan 2006, Amir Saad wrote: Hallo, i use OpenAFS 1.4 , MIT Kerberos i can successfully acquire a ticket and aklog run i got the error fs:'/afs': Connection timed out when i tried to run fs setacl /afs system:anyuser rl can anyone help? Just a guess, but... Turn off dynroot, or stop trying to set an ACL on a fake directory. You shouldn't need to anyway. Derrick ___ OpenAFS-info mailing list OpenAFS-info@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out?
We are having a similar problem on some of our machines. It seems some of our machines time out on file transfers but lookup access seems fine. --Mike John Koyle wrote: I have about 6 volumes on a server and have a separate server that has readonly replicas of those volumes. Call them a, a.b, a.c, a.d, etc. I can access all volumes just fine from two different clients, however one volume, a.c, keeps getting a connection timed out error on the clients. It happens roughly at the same time on both client systems, but does not happen with any other volumes - I can access them just fine. Running fs checkv clears up the problem for awhile, but several hours later (6-8), the problem crops up again. I've tried doing a backup of the volume, deleteing it from the servers, then restoring and it still happens. Does anyone have any ideas for this? Running v1.2.8 on solaris9 servers and RH linux 7.x clients. Thanks! John ___ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info ___ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out
Torbjorn == Torbjorn Pettersson [EMAIL PROTECTED] writes: It seem that there is no space on the device (No space left on device), but why would the client stop responding because of this? Torbjorn You checked so you don't run out of diskspace on the Torbjorn cache? It did, but why did that force a restart of the client? Seems kind'a dumb, doesn't it? Torbjorn I'm using the debian testing openafs packages, Torbjorn v1.2.3final2-3, with with a kerberos 5 server, on amd Torbjorn cpu;s... Me to (Debian and all), just recompiled for my 'semi-potato' box... -- Serbian Legion of Doom $400 million in gold bullion attack Albanian ammunition FBI Peking nitrate ammonium Mossad FSF KGB Waco, Texas Semtex [See http://www.aclu.org/echelonwatch/index.html for more about this] ___ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info
Re: [OpenAFS] Connection timed out
Turbo Fredriksson [EMAIL PROTECTED] writes: Torbjorn == Torbjorn Pettersson [EMAIL PROTECTED] writes: It seem that there is no space on the device (No space left on device), but why would the client stop responding because of this? Torbjorn You checked so you don't run out of diskspace on the Torbjorn cache? It did, but why did that force a restart of the client? Seems kind'a dumb, doesn't it? I seem to remember that there is no actuall consistancy checks on the cache, so I think you are entering kind of an undefined state when you do trash it... I would recomend that you adjust your cachesize settings to make sure that it doesn't happen again. Also, having a separate partition for the cache is a good thing(tm). Torbjorn I'm using the debian testing openafs packages, Torbjorn v1.2.3final2-3, with with a kerberos 5 server, on amd Torbjorn cpu;s... Me to (Debian and all), just recompiled for my 'semi-potato' box... -- Serbian Legion of Doom $400 million in gold bullion attack Albanian ammunition FBI Peking nitrate ammonium Mossad FSF KGB Waco, Texas Semtex [See http://www.aclu.org/echelonwatch/index.html for more about this] ___ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info //Tobbe -- ## Torbjörn Pettersson # Email [EMAIL PROTECTED] Vattugatan 5 # Web www.strul.nu/~tobbe S-111 52 Stockholm, Sweden # ## ___ OpenAFS-info mailing list [EMAIL PROTECTED] https://lists.openafs.org/mailman/listinfo/openafs-info