On Wed, 20 Nov 2002, Ray Olszewski wrote:

> If I had to guess where to start from this description, it would be to look 
> for a LAN client that is generating a lot of traffic for some reason. To 
> give a concrete example, we once had similar symptoms here, and we traced 
> them (after we too wasted a lot of time with line tests, NIC tests, and 
> reviews of iptables rulesets) to a mail-forwarding loop between a DMZ 
> server here and an off-site server that chewed up our DSL bandwidth ... not 
> quite all the time, but whenever *both* the local and the remote host were 
> connected to the Internet (most but not all of the time, giving just enough 
> unpredictabililty to make it *look* like it wasn't a configuration error).
> 
> That's only a guess, though. To me more certain, I'd want to know a bit 
> more about the tests, such as ...
> 
> 1. Does physically disconnecting from the LAN the host that you forward 
> port 25 to affect system performance? What about port 80 (if it is a 
> different host)? Are you running any peer-to-peer apps that might be 
> consuming a lot of bandwidth? (And BTW, how many hosts are on the LAN?)

   Unfortunately, I wasn't able to do a very good job of typing up that
note last night (it's hard to compose well when there is a 2-10 second
pause between typing and seeing the characters on screen).  I will try
to be a little clearer here.

   The internal LAN consists of about 8 machines at the moment.  A few
NT workstations, two Linux desktops, and a group of Linux servers.  All
machines are x86 architecture.

> 2. You wrote that:
> 
> >1) Any standalone machine, plugged directly into the ZyXEL cable modem and
> >configured with the correct static IP address and netmask, gets full
> >bandwidth and brisk ping times.
> 
> "Any" is one of those terms that sounds like is says more than it 
> does.  Put this way, it doesn't actually describe any test; instead, it 
> offers your interpretation of an undescribed test. What was the actual test 
> you did to reach this conclusion? Did you really test EVERY host you have 
> (and how many is that?), each in its normal configuration? Or did you just 
> test 3 or 4 different hosts?

   Three different techs from Time Warner/RR hooked their issue laptops
directly into the ZyXEL modem and got good network performance.  I hooked
my own personal laptop (running Debian 3.0) into the modem and manually
set up the ethernet interface to the proper IP/Netmask/Broadcast, then set
the cable modem as the default gateway.  Saw perfectly good performance.
Took the 486/66 box, removed one of the network cards, booted it up with
the Bering and the Dachstein disks and then logged in at the console.
Good pings, good bandwidth.  Took a desktop Linux box, again running Deb
3.0, plugged *it* directly into the ZyXEL modem and manually configured it
with the static IP information.  It performed fine.  No problems.  Took
one of the Windows NT boxes, plugged *it* directly into the modem and
configured it with the static IP information, and *it* performed fine.
Took one of the desktop boxes, added an extra ethernet card to it, booted
it with the Bering floppy, and so long as the internal network card was
not set up, it worked fine.  At this point, I think use of the word "any"
is perfectly appropriate.  We could plug the other NT workstation and the
other half dozen Linux boxes into it, but it seems like that would be a
total waste of time and effort.  The point is proven to my satisfaction.

> 3. You wrote:
> 
> >5) As soon as the 2nd NIC was working properly so that the box was
> >actually acting as a router/firewall, the network bandwidth and pings went
> >to hell.
> 
> ... and ...
> 
> >It appears only to happen when there are actually
> >two functioning NICs in the box and it is actually working as a
> >router/firewall.
> 
>  From these descriptions, I can't really tell if your test involved 
> connecting the firewall to the LAN or not. My guess assumes that you are 
> describing something that happens only when the LAN is actually connected 
> to the firewall. If I'm wrong ... if you just mean that the 2 NICs are 
> working, but the internal one is not physically connected to anything, then 
> my guess is bad.

   Your guess is not bad, and that was clearly an oversight in the
testing.  I should have physically disconnected the router from the
internal network and seen if the problem persisted, and I have not done
that yet.  The fact that it works when eth1 is not working and doesn't
work when eth1 is working doesn't necessarily say anything about what
would happen if eth1 was working but *not* connected to the intranet.  I
will try that when I get home this evening.

> 4. What traffic levels is the router reporting that it handles? (Calculate 
> this by running "ip -s link" twice, a minute apart, and seeing how much the 
> total traffic changes by. There may be better ways, but that is one that 
> works reliably.) If my guess is right, the load will show as high on both 
> interfaces. If it is high on the external interface only, then the problem 
> is somewhere in the router's configuration ... might it be running some 
> service you forgot to mention? Do the logs show any unusual pattersn of 
> DENYs or REJECTs?

   I will run this test as well when I get home this evening.  The only
unusual numbers of DENY/REJECTs have been due to failing to open up the
appropriate internal ports when adding packages to the Bering router
setup, e.g. no 67/68 when dhcpd was configured, no 123 when I installed
ntpsimpl.lrp.  Opening up the appropriate ports between loc and fw has
solved those problems just fine.  The only significant external DENYs
and REJECTs have come from the occasional port scan or what appear to be
a few slapper-infected boxes.  Occasionally something will try to come
through port 443, which we do not have open.

> 5. Finally, you wrote:
> 
> >    I want to emphasize that this problem started spontaneously
> 
>  From this, I believe that you did nothing to the *router* that caused it. 
> But what about the rest of the LAN? Did you make any changes on the mail 
> server or the Web server?

   Absolutely no changes of *any* kind were made to *any* machine, not
the router, not any of the workstations, not any of the servers.  When I
have working systems, I don't screw with 'em beyond installing security
updates.  The only configuration changes made to the router itself were
setting up the static IP information and turning off dhclient when the
switch was made from residential to commercial RoadRunner service back in
mid to late September.  That configuration ran nonstop with no problems of
any kind all the way through October and eary November.  The web server,
mail server, and SSH box haven't required any updates since the 1.3.26 to
1.3.27 Apache update, and the problem with the firewall appeared well
after that upgrade was made (over a week, maybe as long as two).

   We quite literally saw the network fall off a cliff one morning (Nov
5th) having touched *nothing* on the firewall or the network.  The
original RR and Time Warner techs came out and found some problems at the
tap with a lot of dropped packets and inconsistent ping times which
initially led us to believe that the problem was a hardware problem.  They
fixed those problems, but our situation continued.  They then did a bunch
of work on their routers here in the neighborhood, and last Thursday (the
14th) the problem appeared to be fixed.  The cron job I had set up to ping
the firewall from the outside suddenly dropped from 1800 ms pings to 120
ms pings at around 11:00 on that Thursday morning, and continued to look
good through the afternoon.  When I got home I met up with their tech guy,
a very pleasant fellow named Joe, and we went over the network setup and
firewall configuration and ran a bunch of pings and traceroutes to verify
that all was well.  I set up a cron job on an internal machine to run
every 15 minutes and ping 5 or 6 geographically dispersed boxes as well as
running traceroutes to 3 of them.  Come Sunday morning, with only one bad
ping time and maybe 10-15 total dropped packets (out of over 10,000) in
the log files for that cron job, I backed it off from every 15 minutes to
every 30 minutes.  Everything was good until Tuesday morning.  Between the
cron job running at 9:00 EST and 9:30 EST, the ping times soared from the
35-125 ms range (depending upon the number of hops to the target host) to
the 1700-2000 ms range, and the lost packet percentage jumped to 20-40%.
That behavior has continued until now.

   Note: pings from the router box to the ZyXEL modem itself, i.e. "first
hop" pings, are in the 2-4 ms range.  That certainly doesn't sound like a
hardware problem with the ethernet cards to me.  It seems to happen only
when you go out beyond the modem into the external network at large.  The
whole situation is sort of insane; if it's a hardware problem, why does
it not matter whether I'm using a 486 w/the SMC Ultra cards or an AMD K6
w/a 3C905 and an FA311?  But if it's a software problem, why will it run
fine for days at a time and then stop, with no error messages in the log
files?  Why can I ping the ZyXEL just fine, but anything that goes beyond
the modem drops off the face of the planet?  If the problem is in their
routers, why does any *single* machine hooked to the modem work?  If the
problem is in the routing/masquerading software, why does it happen both
with Dachstein and with Bering, which use different kernels, different
routing mechanisms, and different firewall solutions?  For every possible
source of the problem that I can think of, there seems to be available
evidence or test results to discredit that possibility.

   My thanks to you for your response.  I will get out another note after
running the additional tests you've suggested.

best,
Jim Wiggs


> At 08:00 PM 11/20/02 -0800, James K. Wiggs wrote:
> 
> >  Folks,
> >
> >    I apologize if this is a FAQ, but my net connection is so slow now that
> >I can't effectively search the web for information.  I have a Road Runner
> >Commercial Cable account in the Tampa Bay area; I upgraded to the
> >commercial in late September after having the residential service for
> >about 2 years.  For that entire time, I'd been using the same box as
> >my firewall/router, a 486/66 w/32M and a pair of SMC Ultra NICs.  The
> >software was originally Eigerstein and later Dachstein and worked
> >perfectly the entire time.  So, about 2 weeks ago the network performance
> >totally went to hell in a handbasket.  Ping times, even to RR internal
> >network machines, are now in the 1600-2000 ms time range.  Packet loss is
> >very high, bandwidth is almost nonexistent.  In an effort to solve the
> >problem, after RR had been out many times and made multiple hardware
> >repair efforts, I upgraded the software to Bering RC4.
> >
> >    The situation ias it stands, is this:
> >
> >1) Any standalone machine, plugged directly into the ZyXEL cable modem and
> >configured with the correct static IP address and netmask, gets full
> >bandwidth and brisk ping times.
> >2) The original router/firewall gets miniscule bandwidth and slow pings,
> >whether booted from Dachstein or Bering.
> >3) A different machine, configured with a 3Com 905TX and a NetGear FA311
> >and booted from Bering RC4, *also* gets lousy bandwidth and slow pings.
> >4) That machine, booted before the proper driver was installed to get the
> >FA311 card working, got fast pings and good bandwidth (the 3Com is the
> >external interface).
> >5) As soon as the 2nd NIC was working properly so that the box was
> >actually acting as a router/firewall, the network bandwidth and pings went
> >to hell.
> >
> >    All of this suggests that the problem is in iptables or in Shorewall,
> >but I can find no discussion of this problem in web searches or DejaNews.
> >I have done little to this Bering configuration beyond configuring the
> >static stuff in the networking setup.  I did install ntpdate and opened up
> >port 123 as a result.  I've got the box acting as a DHCP server for the
> >internal network and have opened up 67 & 68 internally for that.  Ports
> >80, 25, and 22 are being forwarded to internal machines for web, email,
> >and SSH access.  Oh, yes: the dnscache package has been configured and the
> >appropriate ports opened up internally and externally for that.  The box
> >is doing NAT for the entire internal network, of course.  I can upload the
> >iptables/shorewall setup if necessary, but this really is a fairly vanilla
> >setup.
> >
> >    Can anyone suggest what could be causing this problem?  Is it a known
> >problem with Bering or Shorewall?   The net connection is slowing down so
> >badly now that I have to cut this short.
> >
> >    I want to emphasize that this problem started spontaneously and now
> >persists regardless of whether I boot from the new Bering floppy or the
> >old Dachstein floppy.  It appears only to happen when there are actually
> >two functioning NICs in the box and it is actually working as a
> >router/firewall.  There was a period of about 4 days, from last Thursday
> >afternoon until Tuesday morning, when the old 486 box with the Bering
> >floppy worked properly.
> >
> >    I will try to follow up on this tomorrow from a faster access point.
> 
> 
> 
> --
> -------------------------------------------"Never tell me the odds!"--------
> Ray Olszewski                                 -- Han Solo
> Palo Alto, California, USA                      [EMAIL PROTECTED]
> -------------------------------------------------------------------------------
> 



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
------------------------------------------------------------------------
leaf-user mailing list: [EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/leaf-user
SR FAQ: http://leaf-project.org/pub/doc/docmanager/docid_1891.html

Reply via email to