Re: [OpenAFS] AFS lag

2009-05-27 Thread Ken Hornstein
I'm no ubik engineer, but as far as I understand it, the protocol
was not designed for even numbers of participating servers. For best
results, three or five servers seem to be optimum.

I hear this frequently, and don't see why it should be true.  The tie
breaking mechanism during an election is simple.

The tie breaking mechanism isn't really the issue here.

My point is that you gain almost no benefit from an even number of
servers.  Specifically, if you have four ubik servers, you have the
same amount of redundancy as if you have three servers(*); you can lose
one and still maintain quorum.

(*) Okay, purists will point out that this is not exactly true.  If you
have four servers and you happen to lose two, AND one of the two remaining
ones is the best server, quorum will still be able to be established.
I could see other reasons for having an even number of servers, but people
should understand exactly what sort of redundancy they can expect out
of a given Ubik configuration.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-05-26 Thread Kim Kimball




Derrick Brashear wrote:

  On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein k...@cmf.nrl.navy.mil wrote:
  
  

  I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.
  

  

I hear this frequently, and don't see why it should be true. The tie
breaking mechanism during an election is simple.

Kim




  
There is a lot of misinformation about Ubik out there; the voting
protocol is actually not complicated, it's just not documented well.

  
  
it's actually well-documented, if you find Kazar's paper on Quorum Completion.

  
  
If your database servers are accessable via the Internet, we could take
a look at them via udebug. Really, there are only a few things that can
go wrong; of all of the pieces of AFS, I think Ubik is one of the most
bulletproof.

  
  
There are a couple (unlikely) open issues; See RT.


  





___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-19 Thread Ken Hornstein
 There is a lot of misinformation about Ubik out there; the voting
 protocol is actually not complicated, it's just not documented well.

it's actually well-documented, if you find Kazar's paper on Quorum Completion.

You know, we should try to find a copy of that and put it somewhere useful.
From what I remember (I think I saw a copy once), the paper gets you about
80% of the way there; the source code gets you the rest of the way.

Actually, I now realize that I _do_ have a copy of it.  Can we put it on
the OpenAFS web site?  I just have the PostScript; it's easy enough to
convert that to PDF.

 If your database servers are accessable via the Internet, we could take
 a look at them via udebug.  Really, there are only a few things that can
 go wrong; of all of the pieces of AFS, I think Ubik is one of the most
 bulletproof.

There are a couple (unlikely) open issues; See RT.

Didn't know about those.  Still, I think we need more information to diagnose
the original problem.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


[OpenAFS] AFS lag

2009-03-18 Thread Abdelkader El mastour
Configuration
Netbsd4
heimdal1.1
arla
Openafs 1.4.5 via pkgsrc
replicated root.afs  root.cell RO
1000 user per server

10 servers for fileserver.

2 servers for vlserver and ptserver

Our users have been experiencing some major lag accessing afs .
It all began when we had an hardware problem with one of our afs servers
(afs-1),accessing afs was laggy for every user on the server
so we decided to move every one of them from this server to one of the nine
others,
we shutdown the broken server take it off the listaddrs list and restart the
vlserver instance.
The slowdown continues..

We turned on the afs-1 server again  but without lunch any afs services and
then no more lags accesing afs.
Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags
are back.
Note#1 : afs servers are up since a year and we've never exeperienced any
issue before.
Note#2 : bos status and sysstat doesnt reveal any issue .
Any guess about the reasons for lags ?

-- 
Abdelkader El mastour
0620477723


Re: [OpenAFS] AFS lag

2009-03-18 Thread Felix Frank

On Wed, 18 Mar 2009, Abdelkader El mastour wrote:


Configuration
Netbsd4
heimdal1.1
arla


You have Arla clients?


Openafs 1.4.5 via pkgsrc
replicated root.afs  root.cell RO
1000 user per server

10 servers for fileserver.

2 servers for vlserver and ptserver


This is not good. I've recently run some tests with 2 DB-servers, and
operation is not optimal. It can take them longer than necessary to 
determine the sync site. 3 servers is pretty much ideal, but even a single 
server works smoother than 2 IMHO.



Our users have been experiencing some major lag accessing afs .
It all began when we had an hardware problem with one of our afs servers
(afs-1),accessing afs was laggy for every user on the server
so we decided to move every one of them from this server to one of the nine
others,
we shutdown the broken server take it off the listaddrs list and restart the
vlserver instance.
The slowdown continues..

We turned on the afs-1 server again  but without lunch any afs services and
then no more lags accesing afs.
Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and lags
are back.
Note#1 : afs servers are up since a year and we've never exeperienced any
issue before.
Note#2 : bos status and sysstat doesnt reveal any issue .
Any guess about the reasons for lags ?


I presume afs-1 was NOT one of your DB servers. If it is, 
CellServDB would be the place to start.


There may be problems with replicated volumes. root.cell should be cached at
all times (are there frequent vos release's?) but who knows...

On afflicted clients, try vos checkv.

HTH
Felix
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Abdelkader El mastour
On Wed, Mar 18, 2009 at 2:54 PM, Derrick Brashear sha...@gmail.com wrote:

 On Wed, Mar 18, 2009 at 5:35 AM, Abdelkader El mastour
 a.elmast...@gmail.com wrote:
  Configuration
  Netbsd4
  heimdal1.1
  arla
  Openafs 1.4.5 via pkgsrc
  replicated root.afs  root.cell RO
  1000 user per server
 
  10 servers for fileserver.

 what's the configuration of the fileservers?

 bos status (any fileserver host) fs -long

 and share the information?


Filelog:
http://perso.epitech.eu/~el-mas_a/Filelog/FileLog
are there any multihomed machines involved, or NATs
What do you mean with multihomed machines ?

-- 
Abdelkader El mastour
0620477723


Re: [OpenAFS] AFS lag

2009-03-18 Thread Abdelkader El mastour
On Wed, Mar 18, 2009 at 5:30 PM, Pesce, Nicholas npe...@qualcomm.comwrote:

 We just experienced significant lag issues at our AFS site for vos exam and
 vos release issues.  This seemed to be caused by a bug with Ubik callbacks
 (version 1.4.7) .  One of our database servers was restarted then all of the
 database servers did not sync properly with the sync-site (only the sync
 site was working). I got all but one of the vlserver's to run.  But until I
 got all 6 servers functioning properly (after patching) we still saw this
 issue.

 Have you checked udebug to ensure that all of your database server
 processes are current, up and giving a beacon?


 I agree with Abdelkader and would recommend having at least 3 database
 servers.  You could be walking on very thin ice with just 2.

 Sincerely,

 --
 Nicholas Pesce
 npe...@qualcomm.com


 -Original Message-
 From: openafs-info-ad...@openafs.org [mailto:
 openafs-info-ad...@openafs.org] On Behalf Of Felix Frank
 Sent: Wednesday, March 18, 2009 4:15 AM
 To: Abdelkader El mastour
 Cc: openafs-info@openafs.org
 Subject: Re: [OpenAFS] AFS lag

 On Wed, 18 Mar 2009, Abdelkader El mastour wrote:

  Configuration
  Netbsd4
  heimdal1.1
  arla

 You have Arla clients?

  Openafs 1.4.5 via pkgsrc
  replicated root.afs  root.cell RO
  1000 user per server
 
  10 servers for fileserver.
 
  2 servers for vlserver and ptserver

 This is not good. I've recently run some tests with 2 DB-servers, and
 operation is not optimal. It can take them longer than necessary to
 determine the sync site. 3 servers is pretty much ideal, but even a single
 server works smoother than 2 IMHO.

  Our users have been experiencing some major lag accessing afs .
  It all began when we had an hardware problem with one of our afs servers
  (afs-1),accessing afs was laggy for every user on the server
  so we decided to move every one of them from this server to one of the
 nine
  others,
  we shutdown the broken server take it off the listaddrs list and restart
 the
  vlserver instance.
  The slowdown continues..
 
  We turned on the afs-1 server again  but without lunch any afs services
 and
  then no more lags accesing afs.
  Since then we've had to shutdown afs-1 ,took it off the listaddrs ,and
 lags
  are back.
  Note#1 : afs servers are up since a year and we've never exeperienced any
  issue before.
  Note#2 : bos status and sysstat doesnt reveal any issue .
  Any guess about the reasons for lags ?

 I presume afs-1 was NOT one of your DB servers. If it is,
 CellServDB would be the place to start.

 There may be problems with replicated volumes. root.cell should be cached
 at
 all times (are there frequent vos release's?) but who knows...

 On afflicted clients, try vos checkv.

 HTH
 Felix
 ___
 OpenAFS-info mailing list
 OpenAFS-info@openafs.org
 https://lists.openafs.org/mailman/listinfo/openafs-info




I agree with Abdelkader and would recommend having at least 3 database
servers.  You could be walking on very thin ice with just 2.
Whats the reason for this ?

-- 
Abdelkader El mastour
0620477723


Re: [OpenAFS] AFS lag

2009-03-18 Thread Felix Frank

I agree with Abdelkader and would recommend having at least 3 database

servers.  You could be walking on very thin ice with just 2.
Whats the reason for this ?


I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.

What I definitely whitnessed is that servers in a cell configured with two
servers take more than a minute to elect a sync site after server restarts.
Three servers are supposed to make it in an instant.

Apart from that, my test cell runs two servers and it works just fine, so long
as no DB server restarts are necessary. It's plain annoying when I do
development on a DB service. There may be more pitfalls in 2-server setups
that I'm unaware of.

Regards
Felix
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Ken Hornstein
I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.

There is a lot of misinformation about Ubik out there; the voting
protocol is actually not complicated, it's just not documented well.
Looking at the source code is even more confusing.

So, let me clear up some misconceptions:

- It's not really that an odd number is optimum; it's just that you're wasting
  a server with an even number.

  Why?  Well, the Ubik voting requires a majority number of servers to win an
  election; if there are 4 servers and only two are available, then that's
  not a majority.  So with 4 servers, you can lose only one server and still
  maintain quorum (same as with three).  You need 5 servers to be able to
  lose two of them.

  Now, there is an extra wrinkle here ... the best server (lowest
  numbered) gets an extra vote.  So in a 4 server configuration,
  you can actually lose two and maintain quorum ... as long as one
  of the two isn't the best server.  But with five servers, you
  can lose ANY two.  But the protocol works fine with two, or three, or
  four, or five.  There is NO magic here.

What I definitely whitnessed is that servers in a cell configured with two
servers take more than a minute to elect a sync site after server restarts.
Three servers are supposed to make it in an instant.

This is one of those mostly-not-true statements that has a bit of truth in
it.  The exact details:

- When brought up, a database server will not vote YES for anyone for 75
  seconds.  This is inviolate.  It doesn't matter if there are two,
  three, or 100 database servers.  If you bring up all your servers
  cold, at the same time, it will take at least 75 seconds for a
  quorum election.

- If you have two database servers and you only restart the best server
  (note: in a two database server cell, only the best server can
  ever be elected as master), a new election will take 75 seconds.
  Why?  Because you have to wait for the best server to be able to
  vote for itself; without that vote, there is not a majority.
 
- If you have three (or more) database servers and you only restart the
  current master, a successful election will happen almost instantly.
  Why?  Because all of the servers that are still up will still vote
  YES for the master; the master's own YES vote is not necessary.  But
  note this only applies if all of the other servers are still running.
  If, for example, you rebooted the master and if it took longer than
  75 seconds for the master to restart, then what will likely happen is
  a new master will be elected.

Getting back to the original poster's question ... by far the most common
problem I have seen with Ubik is bad time synchronization.  All of your
database servers must be synched up time-wise (the protocol depends on
timestamps).  It doesn't need to be femtosecond accuracy; the protocol
defines MAXSKEW as 10 seconds.

If your database servers are accessable via the Internet, we could take
a look at them via udebug.  Really, there are only a few things that can
go wrong; of all of the pieces of AFS, I think Ubik is one of the most
bulletproof.

--Ken
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info


Re: [OpenAFS] AFS lag

2009-03-18 Thread Derrick Brashear
On Wed, Mar 18, 2009 at 9:56 PM, Ken Hornstein k...@cmf.nrl.navy.mil wrote:
I'm no ubik engineer, but as far as I understand it, the protocol was not
designed for even numbers of participating servers. For best results, three
or five servers seem to be optimum.

 There is a lot of misinformation about Ubik out there; the voting
 protocol is actually not complicated, it's just not documented well.

it's actually well-documented, if you find Kazar's paper on Quorum Completion.

 If your database servers are accessable via the Internet, we could take
 a look at them via udebug.  Really, there are only a few things that can
 go wrong; of all of the pieces of AFS, I think Ubik is one of the most
 bulletproof.

There are a couple (unlikely) open issues; See RT.


-- 
Derrick
___
OpenAFS-info mailing list
OpenAFS-info@openafs.org
https://lists.openafs.org/mailman/listinfo/openafs-info