Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-07 Thread thegeezer
On 05/03/15 09:46, Marc Joliet wrote:
> Hi all,
>
> at work I'm (well, *we* are) facing an interesting problem.  Since we are sort
> of stabbing in the dark here, I thought I'd ask here.  Also, since this is 
> from
> work, I will not be able to diverge very many details (not to mention that as 
> a
> student worker I simply don't *know* many details).  However, I do have
> permission from my boss to ask about this in an anonymised fashion.
>
> The symptom we're seeing is that the NIC goes down and DHCP packets stop 
> getting
> through after a certain amount of time.  What happens is:
>
> 1.) The NIC is brought up (some built-in Intel model).
>
> 2.) A DHCP client configures it.
>
> 3.) The network connection is lost at some point (the amount of time this 
> takes
> varies, but it can be as little as 20 minutes).
>
> 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> gets no response.  Sometimes, after many hours (at least 6), it will get a
> DHCPACK, but that's it.  One of our sysadmins says that not only does
> the DHCP server never see the packets, but the managed switch that the PC
> is directly attached to *also* never does (again, except for when the
> occasional DHCPACK comes).
>
> 4.) Restart the network device.  A reboot is not required, but it is necessary
> to terminate the DHCP client.  After that everything works again.
>
> 5.) GOTO 3.
>
> (Note that I have observed that steps 3 and 4 do not necessarily occur in
> order.)
>
> This has been rather baffling, since this problem is limited to 3 computers.
>
> One of them (the longest running) runs Gentoo, courtesy of me.  This is the
> first one we saw the problem with.  Since we couldn't figure it out (switching
> from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump,
> etc., all with help from one of our sysadmins; Google, too, of course), Gentoo
> was "blamed", so we got a replacement PC with Fedora 20 on it, which *also*
> showed this behaviour.  Both PCs run some special software (some of it mine).
> Thus, at some point this software was "blamed".
>
> So we started experimenting: we configured the Fedora PC to *not* start the
> special software, and have not seen any problems all week.  Yesterday 
> afternoon
> I then started *one* of the programs, and had not seen any problems yet by the
> time I went home.
>
> So that would speak *for* that theory, right? Well, for comparison, my boss
> recently started running a separate PC, also with a bog-standard Fedora 20.
> Guess what: it *also* shows the *exact* same behaviour as the other two PCs
> ("journalctl -u NetworkManager" shows pages upon pages of unanswered
> DHCPREQUESTs, with the occasional response thrown in). Note here that this PC
> is on a different switch and in a different VLAN.
>
> The choice of Fedora comes from the fact that we use a Fedora based distro
> internally, so it is "known".  PCs running it have *not* shown the behaviour
> above (AFAIK not even *once*).  Thus, one of the few things I can think of is
> finding out what is different about them relative to the standard Fedora.
>
> Right now my main ideas on what the culprit could be are:
>
> - The computers' kernel/network device is improperly configured.  That is,
>   maybe special configuration is needed for the computers to work properly as
>   clients in the network.  I'm thinking of support for some (from my
>   perspective) obscure protocol(s).
>
> - It's a network problem.  The three computers are in two different VLANs,
>   while the workplace computers running the internal Fedora based distro are 
> in
>   a third (the main network that all the normal Windows and Linux workstations
>   are connected to).  However, they are on the same switch as the two 
> computers
>   running my software.  One argument against this is that the Windows PC that
>   runs on the same VLAN does *not* have any problems like this.
>
> One of the other ideas I had was faulty power management, and I did read of
> problems of the sort regarding the exact same network card that is in the old
> Gentoo machine on an HP support forum (from around 2008).  However, the local
> sysadmin said that they have had nothing but good experience with those 
> network
> cards. Also: *three* computers with NIC power management problems?  That 
> sounds
> a bit far-fetched to me.  Nevertheless, I am not fully discounting the
> possibility.
>
> You can imagine how confusing and frustrating this is.
>
> So, has anybody here ever experienced something like this? Any ideas on what
> could be the cause?
>
> Greetings

Howdy
i've seen this before but not with the nic down event
the problem was old managed alcatel switches combined with questionable
wiring
in my case it was reversed, the gentoo box was providing the dhcp but
then suddenly nothing got dhcp responses
power cycling the switch was a temporary fix
updating the switch firmware helped a lot - went from a daily occurence
to week

Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)

2015-03-06 Thread Daniel Frey
On 03/06/2015 11:57 AM, Marc Joliet wrote:
>> I wasn't aware you had e1000e hardware - those are about as reliable as
>> they come. I've used many of them and never had the slightest trouble at
>> all. By all means study up on firmware and driver options - if you don;t
>> know much about that area it's very illuminating to find out more. But
>> based on experience I'd say the chances of finding an oddity with e1000e
>> are slim, and I'd be looking at a misconfigured switch.
> 
> That's pretty much what the sysadmin said, too, when I asked what he thought
> of the "power management issue" idea.
> 
>> There are some strange switches out there that let you make crazy
>> configuration, like eg blanket drop all broadcast traffic on one or more
>> ports. That's where I'd be looking first.
> 
> Yeah, that agrees with my instinct that it's most something to do with the
> switch.
> 

Is the dhcp server virtualized using vmware? I've come across a very
strange issue where ESXi's e1000e driver is very buggy and caused random
disconnects to the virtual machine. This is strictly server side,
however, nothing to do with the client and/or switch.

I suspect that you probably aren't using ESXi, but figured I'd mention
it anyway. This happened (in my experience) with both Windows and Linux
guests on ESXi, and the only way to get around it was to use some other
driver for the virtual machines (like VMWare's vmnet3 driver.)

Dan



Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)

2015-03-06 Thread Marc Joliet
Am Fri, 06 Mar 2015 21:35:45 +0200
schrieb Alan McKinnon :

> On 06/03/2015 20:45, Marc Joliet wrote:
> > First of all, thanks to everybody who responded so far.
> > 
> > I wanted preface my reply to Alan by mentioning that the local sysadmin made
> > changes to the DHCP server that appear to have worked around whatever the 
> > issue
> > is.
> > 
> > I don't fully understand the error analysis (something to do with the DHCP
> > client reaching a particular state and sending DHCP packets that something
> > in-between it and the DHCP server doesn't like and that might result in 
> > vendor
> > dependent behaviour), but what the DHCP server now does is tell the client 
> > to
> > use the broadcast address as the DHCP server address (which is weird, 
> > because
> > the DHCP clients always switch to the broadcast address after a timeout, 
> > but of
> > course I'm no DHCP expert).  The affected PCs have been working normally all
> > day today.
> 
> In light of what you say below:
> 
> 
> I'd be interested to hear what your sysadmin has to say; dhcp is one of
> those things that JustWork(tm) - it uses regular tcp and nothing funny
> about it at all. The only thing normally between your NIC and the dhcp
> server is a switch, so that's what I'd be looking at.

That's also why I was confused about the whole thing and why I originally
thought that it was either a power management issue or some sort of network
problem.

I'll see if I can ask when I'm there again next week.

[...]
> I wasn't aware you had e1000e hardware - those are about as reliable as
> they come. I've used many of them and never had the slightest trouble at
> all. By all means study up on firmware and driver options - if you don;t
> know much about that area it's very illuminating to find out more. But
> based on experience I'd say the chances of finding an oddity with e1000e
> are slim, and I'd be looking at a misconfigured switch.

That's pretty much what the sysadmin said, too, when I asked what he thought
of the "power management issue" idea.

> There are some strange switches out there that let you make crazy
> configuration, like eg blanket drop all broadcast traffic on one or more
> ports. That's where I'd be looking first.

Yeah, that agrees with my instinct that it's most something to do with the
switch.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpJquk0AAxNT.pgp
Description: Digitale Signatur von OpenPGP


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)

2015-03-06 Thread Alan McKinnon
On 06/03/2015 20:45, Marc Joliet wrote:
> First of all, thanks to everybody who responded so far.
> 
> I wanted preface my reply to Alan by mentioning that the local sysadmin made
> changes to the DHCP server that appear to have worked around whatever the 
> issue
> is.
> 
> I don't fully understand the error analysis (something to do with the DHCP
> client reaching a particular state and sending DHCP packets that something
> in-between it and the DHCP server doesn't like and that might result in vendor
> dependent behaviour), but what the DHCP server now does is tell the client to
> use the broadcast address as the DHCP server address (which is weird, because
> the DHCP clients always switch to the broadcast address after a timeout, but 
> of
> course I'm no DHCP expert).  The affected PCs have been working normally all
> day today.

In light of what you say below:


I'd be interested to hear what your sysadmin has to say; dhcp is one of
those things that JustWork(tm) - it uses regular tcp and nothing funny
about it at all. The only thing normally between your NIC and the dhcp
server is a switch, so that's what I'd be looking at.




> 
> So the current resolution is "it works", but we still don't understand (or at
> least me and my boss don't) what the underlying issue is.  Hence I'm still
> curious what people who know these technologies better than me think.
> 
> Also, I suppose it was confusing to say that the switch never saw the packets.
> The way this was determined was by post-mortem log inspection; AFAIK we didn't
> do any live inspection on the switch.  Based on the workaround, the conclusion
> we came to is that the switch must have dropped the packets (for whatever
> reason) without logging that it did.
> 
> Am Fri, 6 Mar 2015 08:01:44 +0200
> schrieb Alan McKinnon :
> 
> [...]
>> I've seen similar things many times myself (but nevr on Intel network
>> kit so far)
>>
>> A lot of reading and Googling usually leads to the solution:
>>
>> - firmware upgrade for the hardware
> 
> OK, I can look into that.
> 
>> - use the correct driver (this is often non-obvious)
>> - try the in-kernel driver vs any out-of-tree vendor driver
> 
> All PCs run with the e1000e in-kernel module.  I think the Fedora systems run
> 3.18.7, so it's about as current as it can be, too.  Could it really be that 
> the
> kernel selects the wrong driver?
> 
>> - apply driver parameters designed to work around buggy hardware (this
>>   often involves (much reading)
> 
> I will also consider that.  I see that the kernel sources contains
> documentation for the e1000e driver that I can look at.

I wasn't aware you had e1000e hardware - those are about as reliable as
they come. I've used many of them and never had the slightest trouble at
all. By all means study up on firmware and driver options - if you don;t
know much about that area it's very illuminating to find out more. But
based on experience I'd say the chances of finding an oddity with e1000e
are slim, and I'd be looking at a misconfigured switch.

There are some strange switches out there that let you make crazy
configuration, like eg blanket drop all broadcast traffic on one or more
ports. That's where I'd be looking first.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails (WORKED AROUND)

2015-03-06 Thread Marc Joliet
First of all, thanks to everybody who responded so far.

I wanted preface my reply to Alan by mentioning that the local sysadmin made
changes to the DHCP server that appear to have worked around whatever the issue
is.

I don't fully understand the error analysis (something to do with the DHCP
client reaching a particular state and sending DHCP packets that something
in-between it and the DHCP server doesn't like and that might result in vendor
dependent behaviour), but what the DHCP server now does is tell the client to
use the broadcast address as the DHCP server address (which is weird, because
the DHCP clients always switch to the broadcast address after a timeout, but of
course I'm no DHCP expert).  The affected PCs have been working normally all
day today.

So the current resolution is "it works", but we still don't understand (or at
least me and my boss don't) what the underlying issue is.  Hence I'm still
curious what people who know these technologies better than me think.

Also, I suppose it was confusing to say that the switch never saw the packets.
The way this was determined was by post-mortem log inspection; AFAIK we didn't
do any live inspection on the switch.  Based on the workaround, the conclusion
we came to is that the switch must have dropped the packets (for whatever
reason) without logging that it did.

Am Fri, 6 Mar 2015 08:01:44 +0200
schrieb Alan McKinnon :

[...]
> I've seen similar things many times myself (but nevr on Intel network
> kit so far)
> 
> A lot of reading and Googling usually leads to the solution:
> 
> - firmware upgrade for the hardware

OK, I can look into that.

> - use the correct driver (this is often non-obvious)
> - try the in-kernel driver vs any out-of-tree vendor driver

All PCs run with the e1000e in-kernel module.  I think the Fedora systems run
3.18.7, so it's about as current as it can be, too.  Could it really be that the
kernel selects the wrong driver?

> - apply driver parameters designed to work around buggy hardware (this
>   often involves (much reading)

I will also consider that.  I see that the kernel sources contains
documentation for the e1000e driver that I can look at.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpNyNXibXVSG.pgp
Description: Digitale Signatur von OpenPGP


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Mick
On Thursday 05 Mar 2015 21:46:12 Marc Joliet wrote:
> Am Thu, 5 Mar 2015 21:19:46 + 
> schrieb Mick :
> > On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:

> > > Is this a WiFi NIC?
> > > 
> > > Is it possible the device is powering down?
> > > 
> > > I've had lots of problems with WiFi devices powering down (both driver
> > > issues as well as just trying to disable the default setting of
> > > powering down.)
> > > 
> > > Todd
> > 
> > If not a WiFi, have you also tried to mirror a port at the router where
> > the DHCP server is running and sniff packets there?  Does the router see
> > the DHCPREQ coming through from the client PCs?
> 
> They apparently don't even reach the managed switch, which is what the PC
> is directly connected to (but again: the third affected PC is on a
> different switch).  I find this very confusing :-/ (and so does our local
> sysadmin, or so I'm told).
> 
> (I have to mention that the best I can do is relay ideas here to my boss
> and the aforementioned sysadmin, as I don't have access to any of the
> network hardware and software, save for the affected PCs.  I am mostly
> trying to collect ideas.)

If the router does not see the dhcp request frames coming from the PCs then 
the problem won't be with the router.  Check that the NIC on the affected PCs 
is not trying to save power by shutting down, whether this is wired or 
wireless.  As Alan said you'll need to pass some driver parameter to the NIC, 
I usually do this via the /etc/conf.d/modules file, or by adding a .conf file 
in /etc/modprobe.d/

Besides the latest drivers, also check that you are using the latest firmware 
for the NIC if it uses any and check the logs after increasing verbosity on 
the driver to make sure it loads without errors.

-- 
Regards,
Mick


signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Alan McKinnon
On Thu, 5 Mar 2015 13:33:23 -0500
Todd Goodman  wrote:

> * Marc Joliet  [150305 04:47]:
> [..SNIP..]
> > 1.) The NIC is brought up (some built-in Intel model).
> > 
> > 2.) A DHCP client configures it.
> > 
> > 3.) The network connection is lost at some point (the amount of
> > time this takes varies, but it can be as little as 20 minutes).
> > 
> > 4.) Eventually the lease runs out and the DHCP client tries to
> > renew it, but gets no response.  Sometimes, after many hours (at
> > least 6), it will get a DHCPACK, but that's it.  One of our
> > sysadmins says that not only does the DHCP server never see the
> > packets, but the managed switch that the PC is directly attached to
> > *also* never does (again, except for when the occasional DHCPACK
> > comes).
> > 
> > 4.) Restart the network device.  A reboot is not required, but it
> > is necessary to terminate the DHCP client.  After that everything
> > works again.
> > 
> > 5.) GOTO 3.
> [..SNIP..]
> 
> Is this a WiFi NIC?
> 
> Is it possible the device is powering down?
> 
> I've had lots of problems with WiFi devices powering down (both driver
> issues as well as just trying to disable the default setting of
> powering down.)


+1

I've seen similar things many times myself (but nevr on Intel network
kit so far)

A lot of reading and Googling usually leads to the solution:

- firmware upgrade for the hardware
- use the correct driver (this is often non-obvious)
- try the in-kernel driver vs any out-of-tree vendor driver
- apply driver parameters designed to work around buggy hardware (this
  often involves (much reading)

Alan





Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Marc Joliet
Am Thu, 5 Mar 2015 21:19:46 +
schrieb Mick :

> On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:
> > * Marc Joliet  [150305 04:47]:
> > [..SNIP..]
> > 
> > > 1.) The NIC is brought up (some built-in Intel model).
> > > 
> > > 2.) A DHCP client configures it.
> > > 
> > > 3.) The network connection is lost at some point (the amount of time this
> > > takes
> > > 
> > > varies, but it can be as little as 20 minutes).
> > > 
> > > 4.) Eventually the lease runs out and the DHCP client tries to renew it,
> > > but
> > > 
> > > gets no response.  Sometimes, after many hours (at least 6), it will
> > > get a DHCPACK, but that's it.  One of our sysadmins says that not
> > > only does the DHCP server never see the packets, but the managed
> > > switch that the PC is directly attached to *also* never does (again,
> > > except for when the occasional DHCPACK comes).
> > > 
> > > 4.) Restart the network device.  A reboot is not required, but it is
> > > necessary
> > > 
> > > to terminate the DHCP client.  After that everything works again.
> > > 
> > > 5.) GOTO 3.
> > 
> > [..SNIP..]
> > 
> > Is this a WiFi NIC?
> > 
> > Is it possible the device is powering down?
> > 
> > I've had lots of problems with WiFi devices powering down (both driver
> > issues as well as just trying to disable the default setting of powering
> > down.)
> > 
> > Todd
> 
> If not a WiFi, have you also tried to mirror a port at the router where the 
> DHCP server is running and sniff packets there?  Does the router see the 
> DHCPREQ coming through from the client PCs?

They apparently don't even reach the managed switch, which is what the PC is
directly connected to (but again: the third affected PC is on a different
switch).  I find this very confusing :-/ (and so does our local sysadmin, or
so I'm told).

(I have to mention that the best I can do is relay ideas here to my boss and the
aforementioned sysadmin, as I don't have access to any of the network
hardware and software, save for the affected PCs.  I am mostly trying to
collect ideas.)

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpqNq9iG2GUe.pgp
Description: Digitale Signatur von OpenPGP


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Marc Joliet
Am Thu, 5 Mar 2015 13:33:23 -0500
schrieb Todd Goodman :

> * Marc Joliet  [150305 04:47]:
> [..SNIP..]
> > 1.) The NIC is brought up (some built-in Intel model).
> > 
> > 2.) A DHCP client configures it.
> > 
> > 3.) The network connection is lost at some point (the amount of time this 
> > takes
> > varies, but it can be as little as 20 minutes).
> > 
> > 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> > gets no response.  Sometimes, after many hours (at least 6), it will 
> > get a
> > DHCPACK, but that's it.  One of our sysadmins says that not only does
> > the DHCP server never see the packets, but the managed switch that the 
> > PC
> > is directly attached to *also* never does (again, except for when the
> > occasional DHCPACK comes).
> > 
> > 4.) Restart the network device.  A reboot is not required, but it is 
> > necessary
> > to terminate the DHCP client.  After that everything works again.
> > 
> > 5.) GOTO 3.
> [..SNIP..]
> 
> Is this a WiFi NIC?

Nope, it's wired.

> Is it possible the device is powering down?

I mentioned the possibility, but don't find it *that* credible, since three
different PCs (with different NICs) have shown the problem.  Plus, sometimes the
one affected PC I work on can still reach the internet (i.e., a browser works),
even though it has already ceased to be reachable.

[...]

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgpzmZvBUN2SC.pgp
Description: Digitale Signatur von OpenPGP


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Mick
On Thursday 05 Mar 2015 18:33:23 Todd Goodman wrote:
> * Marc Joliet  [150305 04:47]:
> [..SNIP..]
> 
> > 1.) The NIC is brought up (some built-in Intel model).
> > 
> > 2.) A DHCP client configures it.
> > 
> > 3.) The network connection is lost at some point (the amount of time this
> > takes
> > 
> > varies, but it can be as little as 20 minutes).
> > 
> > 4.) Eventually the lease runs out and the DHCP client tries to renew it,
> > but
> > 
> > gets no response.  Sometimes, after many hours (at least 6), it will
> > get a DHCPACK, but that's it.  One of our sysadmins says that not
> > only does the DHCP server never see the packets, but the managed
> > switch that the PC is directly attached to *also* never does (again,
> > except for when the occasional DHCPACK comes).
> > 
> > 4.) Restart the network device.  A reboot is not required, but it is
> > necessary
> > 
> > to terminate the DHCP client.  After that everything works again.
> > 
> > 5.) GOTO 3.
> 
> [..SNIP..]
> 
> Is this a WiFi NIC?
> 
> Is it possible the device is powering down?
> 
> I've had lots of problems with WiFi devices powering down (both driver
> issues as well as just trying to disable the default setting of powering
> down.)
> 
> Todd

If not a WiFi, have you also tried to mirror a port at the router where the 
DHCP server is running and sniff packets there?  Does the router see the 
DHCPREQ coming through from the client PCs?

-- 
Regards,
Mick


signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Todd Goodman
* Marc Joliet  [150305 04:47]:
[..SNIP..]
> 1.) The NIC is brought up (some built-in Intel model).
> 
> 2.) A DHCP client configures it.
> 
> 3.) The network connection is lost at some point (the amount of time this 
> takes
> varies, but it can be as little as 20 minutes).
> 
> 4.) Eventually the lease runs out and the DHCP client tries to renew it, but
> gets no response.  Sometimes, after many hours (at least 6), it will get a
> DHCPACK, but that's it.  One of our sysadmins says that not only does
> the DHCP server never see the packets, but the managed switch that the PC
> is directly attached to *also* never does (again, except for when the
> occasional DHCPACK comes).
> 
> 4.) Restart the network device.  A reboot is not required, but it is necessary
> to terminate the DHCP client.  After that everything works again.
> 
> 5.) GOTO 3.
[..SNIP..]

Is this a WiFi NIC?

Is it possible the device is powering down?

I've had lots of problems with WiFi devices powering down (both driver
issues as well as just trying to disable the default setting of powering
down.)

Todd



[gentoo-user] Strange network behaviour: NIC goes down, DHCP lease renewal fails

2015-03-05 Thread Marc Joliet
Hi all,

at work I'm (well, *we* are) facing an interesting problem.  Since we are sort
of stabbing in the dark here, I thought I'd ask here.  Also, since this is from
work, I will not be able to diverge very many details (not to mention that as a
student worker I simply don't *know* many details).  However, I do have
permission from my boss to ask about this in an anonymised fashion.

The symptom we're seeing is that the NIC goes down and DHCP packets stop getting
through after a certain amount of time.  What happens is:

1.) The NIC is brought up (some built-in Intel model).

2.) A DHCP client configures it.

3.) The network connection is lost at some point (the amount of time this takes
varies, but it can be as little as 20 minutes).

4.) Eventually the lease runs out and the DHCP client tries to renew it, but
gets no response.  Sometimes, after many hours (at least 6), it will get a
DHCPACK, but that's it.  One of our sysadmins says that not only does
the DHCP server never see the packets, but the managed switch that the PC
is directly attached to *also* never does (again, except for when the
occasional DHCPACK comes).

4.) Restart the network device.  A reboot is not required, but it is necessary
to terminate the DHCP client.  After that everything works again.

5.) GOTO 3.

(Note that I have observed that steps 3 and 4 do not necessarily occur in
order.)

This has been rather baffling, since this problem is limited to 3 computers.

One of them (the longest running) runs Gentoo, courtesy of me.  This is the
first one we saw the problem with.  Since we couldn't figure it out (switching
from dhcpcd to dhclient, turning off the firewall, monitoring with tcpdump,
etc., all with help from one of our sysadmins; Google, too, of course), Gentoo
was "blamed", so we got a replacement PC with Fedora 20 on it, which *also*
showed this behaviour.  Both PCs run some special software (some of it mine).
Thus, at some point this software was "blamed".

So we started experimenting: we configured the Fedora PC to *not* start the
special software, and have not seen any problems all week.  Yesterday afternoon
I then started *one* of the programs, and had not seen any problems yet by the
time I went home.

So that would speak *for* that theory, right? Well, for comparison, my boss
recently started running a separate PC, also with a bog-standard Fedora 20.
Guess what: it *also* shows the *exact* same behaviour as the other two PCs
("journalctl -u NetworkManager" shows pages upon pages of unanswered
DHCPREQUESTs, with the occasional response thrown in). Note here that this PC
is on a different switch and in a different VLAN.

The choice of Fedora comes from the fact that we use a Fedora based distro
internally, so it is "known".  PCs running it have *not* shown the behaviour
above (AFAIK not even *once*).  Thus, one of the few things I can think of is
finding out what is different about them relative to the standard Fedora.

Right now my main ideas on what the culprit could be are:

- The computers' kernel/network device is improperly configured.  That is,
  maybe special configuration is needed for the computers to work properly as
  clients in the network.  I'm thinking of support for some (from my
  perspective) obscure protocol(s).

- It's a network problem.  The three computers are in two different VLANs,
  while the workplace computers running the internal Fedora based distro are in
  a third (the main network that all the normal Windows and Linux workstations
  are connected to).  However, they are on the same switch as the two computers
  running my software.  One argument against this is that the Windows PC that
  runs on the same VLAN does *not* have any problems like this.

One of the other ideas I had was faulty power management, and I did read of
problems of the sort regarding the exact same network card that is in the old
Gentoo machine on an HP support forum (from around 2008).  However, the local
sysadmin said that they have had nothing but good experience with those network
cards. Also: *three* computers with NIC power management problems?  That sounds
a bit far-fetched to me.  Nevertheless, I am not fully discounting the
possibility.

You can imagine how confusing and frustrating this is.

So, has anybody here ever experienced something like this? Any ideas on what
could be the cause?

Greetings
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


pgp9dnj00NEdP.pgp
Description: Digitale Signatur von OpenPGP