Re: e1000 driver and samba

2007-09-14 Thread Kok, Auke

L F wrote:

Folks,
I've been playing with multiple gigabit ethernet drivers to get samba
3.0.25+ to work reliably. The situation is as follows.
I have a network, one of the machines on the network is a
server/firewall. It contains an Intel PRO1000 dual port PCI Express
card and runs Debian-testing.
The machine is running shorewall 3.4.5 and at present, one port of the
PRO1000 is configured as the WAN port, the other is bridged to a tap
device for virtualbox and is running as the LAN port.
Samba 3.0.25+ will either lose connection or - more worrisomely -
corrupt data in files upon sustained traffic.
One of the tests that consistently fails is mounting a samba share
onto any WinXP client, then trying to unzip a file from the
mounted/mapped drive into the drive itself (i.e. unzipping
Z:\Stuff\qqq.zip to Z:\Stuff\qqq\* ).
If the zip file is of any significant size, one of two things will
happen. Either the client will complain about losing connection to the
share - with a corrisponding error in the samba logs - or everything
will be fine.. except the files will be corrupt.
The unusual thing is that going through the TAP interface from a
Virtualbox machine yields no problems even when transferring tens of
GBs of data.
Copying a large file (500MB+) also has the same effect.
Now, the machine worked when it was using an onboard Realtek 8169
chipset on a 945G board from ASUS, but it worked slowly.


this slowness might have been masking the issue


I upgraded to
a P965 chipset, started using the realtek driver for the 8110B on that
board.. and started getting consistent samba errors. I therefore
killed the onboard LAN, switched to the Intel board, tried both the
7.6.5 driver on the Intel website AND the driver in the 2.6.20+
kernels - 7.2.x IIRC - and it fails, less than it did with the
realtek, but it fails. Switching back and forth between 2.6.18,
2.6.20.x and 2.6.22.x yielded no improvements. I could use some help,
because I refuse to believe that there isn't a reliable PCIexpress
gigeth/samba combo available.


I have not yet seen other reports of this issue, and it would be interesting to 
see if the stack or driver is seeing errors. Please post `ethtool -S eth0` after 
the samba connection resets or fails.


Just as a precaution, try a different ethernet cable. Even the switch in between 
the target and you might have issues.


I know our lab folks do plenty of samba testing but I will see if they can run a 
stress test against a smb target in the way that you describe.


Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread L F
On 9/14/07, Francois Romieu <[EMAIL PROTECTED]> wrote:
> For the 8169 or the 8110, try 2.6.23-rc6 +
>
> http://www.fr.zoreil.com/people/francois/misc/20070903-2.6.23-rc5-r8169-test.patch
Thank you, I will give that a whirl also, because there are some
machine builds which will not have Intel boards in them and they need
to work, no questions asked. I will report back.

Rgds,
LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread Francois Romieu
L F <[EMAIL PROTECTED]> :
[...]
> Now, the machine worked when it was using an onboard Realtek 8169
> chipset on a 945G board from ASUS, but it worked slowly. I upgraded to
> a P965 chipset, started using the realtek driver for the 8110B on that
> board.. and started getting consistent samba errors. I therefore
> killed the onboard LAN, switched to the Intel board, tried both the
> 7.6.5 driver on the Intel website AND the driver in the 2.6.20+
> kernels - 7.2.x IIRC - and it fails, less than it did with the
> realtek, but it fails. Switching back and forth between 2.6.18,
> 2.6.20.x and 2.6.22.x yielded no improvements. I could use some help,
> because I refuse to believe that there isn't a reliable PCIexpress
> gigeth/samba combo available.
> For further reference, the kernel versions are those mentioned above,
> compiled with gcc-3.4.6 and gcc-4.1.2 (current on debian-testing),
> with no improvement between the two.
> Any and all indications appreciated.

For the 8169 or the 8110, try 2.6.23-rc6 +

http://www.fr.zoreil.com/people/francois/misc/20070903-2.6.23-rc5-r8169-test.patch

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread L F
On 9/14/07, Kok, Auke <[EMAIL PROTECTED]> wrote:
> this slowness might have been masking the issue
That is possible. However, it worked for upwards of twelve months
without an error.

> I have not yet seen other reports of this issue, and it would be interesting 
> to
> see if the stack or driver is seeing errors. Please post `ethtool -S eth0` 
> after
> the samba connection resets or fails.
If you look for it on the Realtek cards, there had been sporadic
issues up to late 2005. The solution posted universally was 'change
card'.

I include the content of ethtool -S as requested:
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 43538709
 tx_packets: 68726231
 rx_bytes: 34124849453
 tx_bytes: 74817483835
 rx_broadcast: 20891
 tx_broadcast: 8941
 rx_multicast: 459
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 459
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 486
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 488
 rx_flow_control_xoff: 488
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 34124849453
 rx_csum_offload_good: 43449333
 rx_csum_offload_errors: 0
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0

I am no expert, but I do not see anything that obviously points to an
issue there.
Now, something I did not mention before, though it was clearly evident
from context, is that the errors ONLY occur on samba WRITE. I can read
hundreds of GBs of data without error.

> Just as a precaution, try a different ethernet cable. Even the switch in 
> between
> the target and you might have issues.
I will try that and report back. I would not suspect the switch
because transferring between other machines - WinXP machines -
operates correctly, as far as I can tell.

> I know our lab folks do plenty of samba testing but I will see if they can 
> run a
> stress test against a smb target in the way that you describe.
Thank you, I would appreciate that. My concern is more generalised
than this single machine: I will have to check a significant number of
other production machines to see if such errors are common.

Rgds,
Luigi Fabio
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread Kok, Auke

L F wrote:

On 9/14/07, Kok, Auke <[EMAIL PROTECTED]> wrote:

this slowness might have been masking the issue

That is possible. However, it worked for upwards of twelve months
without an error.


I have not yet seen other reports of this issue, and it would be interesting to
see if the stack or driver is seeing errors. Please post `ethtool -S eth0` after
the samba connection resets or fails.

If you look for it on the Realtek cards, there had been sporadic
issues up to late 2005. The solution posted universally was 'change
card'.

I include the content of ethtool -S as requested:
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 43538709
 tx_packets: 68726231
 rx_bytes: 34124849453
 tx_bytes: 74817483835
 rx_broadcast: 20891
 tx_broadcast: 8941
 rx_multicast: 459
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 459
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 486
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 488
 rx_flow_control_xoff: 488
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 34124849453
 rx_csum_offload_good: 43449333
 rx_csum_offload_errors: 0
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0

I am no expert, but I do not see anything that obviously points to an
issue there.
Now, something I did not mention before, though it was clearly evident
from context, is that the errors ONLY occur on samba WRITE. I can read
hundreds of GBs of data without error.


can you describe your setup a bit more in detail? you're writing from a linux 
client to a windows smb server? or even to a linux server? which end sees the 
connection drop? the samba server? the samba linux client?


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread L F
> can you describe your setup a bit more in detail? you're writing from a linux
> client to a windows smb server? or even to a linux server? which end sees the
> connection drop? the samba server? the samba linux client?
Certainly.
I have a LAN, with two switches in a stack. There currently are 7
WinXP clients and one linux machine. The linux machine acts as a samba
server and as a firewall/gateway.
The two ports of the PRO/1000 in the linux box are connected to the
LAN (eth4) and to a Comcast modem (eth3) respectively. Shorewall 3.4.5
is running on the linux machine, with a strong firewall + NAT setup.
Further, the linux machine currently has a tap device bridged into the
LAN side, for virtualbox.
Therefore, eth3 is a plain ethernet interface. br0, on the lan side,
is tap0 + eth4.
If I get any client on the LAN side, I can read from the linux box
without a problem. However, if I attempt to write to the linux box
from a LANside client, it will fail. If traffic is low, the failures
are sporadic. If traffic is high (large file and/or multiple incoming
files) the failure is guaranteed, either in 'delayed write fail' mode
on the client or in silent corruption of the file (much worse). If
read/write activity is combined, for instance when I unzip a zip
archive to its own directory, failure is guaranteed and rapid, with a
'delayed write fail' on the client after 50MB or so.
I can post .config and anything else you may want if you require it. I
tried changing cable as you suggested with little success. I'll try
changing switch port, just to cover all bases.
>
> Auke
>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-14 Thread Bill Fink
On Fri, 14 Sep 2007, L F wrote:

> > can you describe your setup a bit more in detail? you're writing from a 
> > linux
> > client to a windows smb server? or even to a linux server? which end sees 
> > the
> > connection drop? the samba server? the samba linux client?
> Certainly.
> I have a LAN, with two switches in a stack. There currently are 7
> WinXP clients and one linux machine. The linux machine acts as a samba
> server and as a firewall/gateway.
> The two ports of the PRO/1000 in the linux box are connected to the
> LAN (eth4) and to a Comcast modem (eth3) respectively. Shorewall 3.4.5
> is running on the linux machine, with a strong firewall + NAT setup.
> Further, the linux machine currently has a tap device bridged into the
> LAN side, for virtualbox.
> Therefore, eth3 is a plain ethernet interface. br0, on the lan side,
> is tap0 + eth4.
> If I get any client on the LAN side, I can read from the linux box
> without a problem. However, if I attempt to write to the linux box
> from a LANside client, it will fail. If traffic is low, the failures
> are sporadic. If traffic is high (large file and/or multiple incoming
> files) the failure is guaranteed, either in 'delayed write fail' mode
> on the client or in silent corruption of the file (much worse). If
> read/write activity is combined, for instance when I unzip a zip
> archive to its own directory, failure is guaranteed and rapid, with a
> 'delayed write fail' on the client after 50MB or so.
> I can post .config and anything else you may want if you require it. I
> tried changing cable as you suggested with little success. I'll try
> changing switch port, just to cover all bases.

Would it be worth a shot to try disabling the receiver hardware
checksumming (ethtool -K ethX rx off)?

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread L F
On 9/15/07, Bill Fink <[EMAIL PROTECTED]> wrote:
> Would it be worth a shot to try disabling the receiver hardware
> checksumming (ethtool -K ethX rx off)?
I just did, unfortunately it doesn't seem to change much.
In the various attempts, however, I seem to have improved something,
maybe. As I mentioned, the machine work as both a gateway and a samba
server. It seems that as long as the one samba operation is the only
activity that goes on between a specific client and the samba server,
at least I do not get timeouts. I have to investigate further on file
corruption, but the timeout doesn't occur.
As soon as I try to perform a second samba operation - even on read -
or I generate singificant net traffic, the samba connection times out.
I have to assume that the added load is just enough to overload the
link.
The other curious fact is that occasionally the writing effectively
pauses for 1-2s. It then resumes, with no ill effect. I have to assume
that when the timeouts occur, the same thing happens for a
sufficiently long period of time and the connection drops.
For further reference, in case any of it is relevant, the samba shares
are ext3 filesystems residing on a SATA based RAID5 (md) array. I
include this because at a certain point I started suspecting the
filesystem before the network, but it doesn't make too much sense.

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread L F
I also upgraded to samba-3.0.26-1 and so far the problem seems
significantly less frequent and limited, unfortunately, to the
'silent' corruption. I am still running with HW checksumming off, with
a new cable (cat6, even though it's 50cm long, so I could probably be
running chicken wire), on the original switch port.
I wonder why this isn't more widespread?

LF / now very puzzled
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread James Chapman

Kok, Auke wrote:

L F wrote:

On 9/14/07, Kok, Auke <[EMAIL PROTECTED]> wrote:

this slowness might have been masking the issue

That is possible. However, it worked for upwards of twelve months
without an error.

I have not yet seen other reports of this issue, and it would be 
interesting to
see if the stack or driver is seeing errors. Please post `ethtool -S 
eth0` after

the samba connection resets or fails.

If you look for it on the Realtek cards, there had been sporadic
issues up to late 2005. The solution posted universally was 'change
card'.

I include the content of ethtool -S as requested:
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 43538709
 tx_packets: 68726231
 rx_bytes: 34124849453
 tx_bytes: 74817483835
 rx_broadcast: 20891
 tx_broadcast: 8941
 rx_multicast: 459
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 459
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 486
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 488
 rx_flow_control_xoff: 488
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 34124849453


Are these long frames expected in your network? What is the MTU of the 
transmitting clients? Perhaps this might explain why reads work (because 
data is coming from the Linux box so the packets have smaller MTU) while 
writes cause delays or packet loss because the clients are sending long 
frames which are getting fragmented?



 rx_csum_offload_good: 43449333
 rx_csum_offload_errors: 0
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0

I am no expert, but I do not see anything that obviously points to an
issue there.
Now, something I did not mention before, though it was clearly evident
from context, is that the errors ONLY occur on samba WRITE. I can read
hundreds of GBs of data without error.


can you describe your setup a bit more in detail? you're writing from a 
linux client to a windows smb server? or even to a linux server? which 
end sees the connection drop? the samba server? the samba linux client?


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread Kok, Auke

James Chapman wrote:

Kok, Auke wrote:

L F wrote:

On 9/14/07, Kok, Auke <[EMAIL PROTECTED]> wrote:

this slowness might have been masking the issue

That is possible. However, it worked for upwards of twelve months
without an error.

I have not yet seen other reports of this issue, and it would be 
interesting to
see if the stack or driver is seeing errors. Please post `ethtool -S 
eth0` after

the samba connection resets or fails.

If you look for it on the Realtek cards, there had been sporadic
issues up to late 2005. The solution posted universally was 'change
card'.

I include the content of ethtool -S as requested:
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 43538709
 tx_packets: 68726231
 rx_bytes: 34124849453
 tx_bytes: 74817483835
 rx_broadcast: 20891
 tx_broadcast: 8941
 rx_multicast: 459
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 459
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 486


this one I wonder about, and might cause delays, I'll have to look up what it 
exactly could implicate though.



 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 488
 rx_flow_control_xoff: 488
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 34124849453


Are these long frames expected in your network? What is the MTU of the 
transmitting clients? Perhaps this might explain why reads work (because 
data is coming from the Linux box so the packets have smaller MTU) while 
writes cause delays or packet loss because the clients are sending long 
frames which are getting fragmented?


those are not "long frames" but the number of bytes the hardware counted in its 
"long" data type based byte counter.


Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread L F
> >>>  tx_deferred_ok: 486
>
> this one I wonder about, and might cause delays, I'll have to look up what it
> exactly could implicate though.
Please do and let me know. samba 3.0.26 helped, but the issue is still there.

> those are not "long frames" but the number of bytes the hardware counted in 
> its
> "long" data type based byte counter.
Thank you for confirming that. It looks like it comes out to a little
over 32GB received... that sounds right.

> Auke

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread L F
On 9/15/07, James Chapman <[EMAIL PROTECTED]> wrote:
> Are these long frames expected in your network? What is the MTU of the
> transmitting clients? Perhaps this might explain why reads work (because
> data is coming from the Linux box so the packets have smaller MTU) while
> writes cause delays or packet loss because the clients are sending long
> frames which are getting fragmented?
I doublechecked in any case. All machines on the network have the
default XP MTU of 1500. So does the linux machine. I therefore foresee
no problem there.

> James Chapman

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-15 Thread Kok, Auke
L F wrote:
> tx_deferred_ok: 486
>> this one I wonder about, and might cause delays, I'll have to look
>> up what it exactly could implicate though.
> Please do and let me know. samba 3.0.26 helped, but the issue is
> still there.

ok, from the spec: tx_deferred_ok is what is in the DC stats register. DC stands
for "Deferred Count". This initially is meant to track how often the TX unit 
cannot
send because the medium is busy in a Half-Duplex link state.

To me it suggests that your speed is not full-duplex. Check `ethtool eth0` 
output
and see if your link is full duplex or not. also check previous kernel messages
and see what the e1000 driver posted there for link speed messages (as in 
"e1000:
 Link is UP speed XXX duplex YYY")

Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-16 Thread James Chapman

Kok, Auke wrote:

James Chapman wrote:

Kok, Auke wrote:



 rx_long_byte_count: 34124849453


Are these long frames expected in your network? What is the MTU of the 
transmitting clients? Perhaps this might explain why reads work 
(because data is coming from the Linux box so the packets have smaller 
MTU) while writes cause delays or packet loss because the clients are 
sending long frames which are getting fragmented?


those are not "long frames" but the number of bytes the hardware counted 
in its "long" data type based byte counter.


Thanks for correcting me, Auke.

Should this counter be renamed to avoid someone else making this mistake 
in the future? Just a thought.


--
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-16 Thread Kok, Auke
James Chapman wrote:
> Kok, Auke wrote:
>> James Chapman wrote:
>>> Kok, Auke wrote:
> 
>  rx_long_byte_count: 34124849453
>>>
>>> Are these long frames expected in your network? What is the MTU of
>>> the transmitting clients? Perhaps this might explain why reads work
>>> (because data is coming from the Linux box so the packets have
>>> smaller MTU) while writes cause delays or packet loss because the
>>> clients are sending long frames which are getting fragmented?
>>
>> those are not "long frames" but the number of bytes the hardware
>> counted in its "long" data type based byte counter.
> 
> Thanks for correcting me, Auke.
> 
> Should this counter be renamed to avoid someone else making this mistake
> in the future? Just a thought.

well, that would break tools that read this value. And for all of these stats
we can say that you should read our SDM's to figure out what they really
mean anyway, hence my caution to interpret the other value at first.

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread L F
> To me it suggests that your speed is not full-duplex. Check `ethtool eth0` 
> output
> and see if your link is full duplex or not. also check previous kernel 
> messages
> and see what the e1000 driver posted there for link speed messages (as in 
> "e1000:
>  Link is UP speed XXX duplex YYY")
from dmesg:
device eth4 entered promiscuous mode
e1000: eth4: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: RX/TX
[It looks like the e1000 driver that came in the kernel is Intel(R)
PRO/1000 Network Driver - version 7.3.20-k2 - would there be any
benefit to trying the 7.6.5 from the Intel website again?]

from ethtool:
beehive:~# ethtool eth4
Settings for eth4:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full
100baseT/Half 100baseT/Full
1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x0007 (7)
Link detected: yes

As best I can tell, the card is in full duplex mode.
Because of a 'running out of ideas' compulsion I disassembled and
reassembled the machine completely, ran a memory test overnight,
changed the cable AGAIN with a CAT6 of the shortest possible length.
That plus samba-3.0.26-1 seem to have cured the disconnects - as a
matter of fact I CAN'T get the machine to disconnect anymore, even
under completely artificial loads (i.e. stress test quality, not
average use) from five clients (I know, that isn't saying much, but it
was failing spectacularly at ONE before, so I figure this may be worth
mentioning).
However, the incorrect file transfer still occurs with large files
(500MB+). My original thought behind the disassembly/reassembly/memory
test was that possibly the issue was hardware related, but I seem to
have eliminated that possiblity.
Further, I checked. There are currently 20+ machines in production
with the same debian distribution and kernel, running on 975X / P965
boards, all with r8169 drivers, doing RAID5 fileserver duty. They
work. With significant numbers (up to 65) of clients. This one doesn't
want to. I can't help but think it's the NIC/driver combo, but it
seems absurd to me.

Rgds,
LF

> Cheers,
>
> Auke
>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread Kok, Auke
L F wrote:
>> To me it suggests that your speed is not full-duplex. Check `ethtool eth0` 
>> output
>> and see if your link is full duplex or not. also check previous kernel 
>> messages
>> and see what the e1000 driver posted there for link speed messages (as in 
>> "e1000:
>>  Link is UP speed XXX duplex YYY")
> from dmesg:
> device eth4 entered promiscuous mode
> e1000: eth4: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex,
> Flow Control: RX/TX
> [It looks like the e1000 driver that came in the kernel is Intel(R)
> PRO/1000 Network Driver - version 7.3.20-k2 - would there be any
> benefit to trying the 7.6.5 from the Intel website again?]
> 
> from ethtool:
> beehive:~# ethtool eth4
> Settings for eth4:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Supports auto-negotiation: Yes
> Advertised link modes:  10baseT/Half 10baseT/Full
> 100baseT/Half 100baseT/Full
> 1000baseT/Full
> Advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: Twisted Pair
> PHYAD: 0
> Transceiver: internal
> Auto-negotiation: on
> Supports Wake-on: d
> Wake-on: d
> Current message level: 0x0007 (7)
> Link detected: yes
> 
> As best I can tell, the card is in full duplex mode.
> Because of a 'running out of ideas' compulsion I disassembled and
> reassembled the machine completely, ran a memory test overnight,
> changed the cable AGAIN with a CAT6 of the shortest possible length.

The statistic we were looking at _will_ increase when running in half duplex,
but if it increases when running in full duplex might indicate a hardware
failure. Probably you have fixed the issue with the CAT6 cable.

Can you run this new configuration with the old cable? that would eliminate
the cable (or not)

> That plus samba-3.0.26-1 seem to have cured the disconnects - as a
> matter of fact I CAN'T get the machine to disconnect anymore, even
> under completely artificial loads (i.e. stress test quality, not
> average use) from five clients (I know, that isn't saying much, but it
> was failing spectacularly at ONE before, so I figure this may be worth
> mentioning).
> However, the incorrect file transfer still occurs with large files
> (500MB+). My original thought behind the disassembly/reassembly/memory
> test was that possibly the issue was hardware related, but I seem to
> have eliminated that possiblity.
> Further, I checked. There are currently 20+ machines in production
> with the same debian distribution and kernel, running on 975X / P965
> boards, all with r8169 drivers, doing RAID5 fileserver duty. They
> work. With significant numbers (up to 65) of clients. This one doesn't
> want to. I can't help but think it's the NIC/driver combo, but it
> seems absurd to me.

A single port failure on a switch can also happen, and samba is definately
a good test for defective hardware. I cannot rule out anything from the
information we have gotten yet.

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread Rick Jones

Kok, Auke wrote:

L F wrote:


tx_deferred_ok: 486


this one I wonder about, and might cause delays, I'll have to look
up what it exactly could implicate though.


Please do and let me know. samba 3.0.26 helped, but the issue is
still there.



ok, from the spec: tx_deferred_ok is what is in the DC stats register. DC stands
for "Deferred Count". This initially is meant to track how often the TX unit 
cannot
send because the medium is busy in a Half-Duplex link state.

To me it suggests that your speed is not full-duplex. Check `ethtool eth0` 
output
and see if your link is full duplex or not. also check previous kernel messages
and see what the e1000 driver posted there for link speed messages (as in 
"e1000:
 Link is UP speed XXX duplex YYY")


Shouldn't there then have been at least _some_ collisions reported in 
the stats?  And perhaps some late collisions?


rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread Kok, Auke
Rick Jones wrote:
> Kok, Auke wrote:
>> L F wrote:
>>
>>> tx_deferred_ok: 486

 this one I wonder about, and might cause delays, I'll have to look
 up what it exactly could implicate though.
>>>
>>> Please do and let me know. samba 3.0.26 helped, but the issue is
>>> still there.
>>
>>
>> ok, from the spec: tx_deferred_ok is what is in the DC stats register.
>> DC stands
>> for "Deferred Count". This initially is meant to track how often the
>> TX unit cannot
>> send because the medium is busy in a Half-Duplex link state.
>>
>> To me it suggests that your speed is not full-duplex. Check `ethtool
>> eth0` output
>> and see if your link is full duplex or not. also check previous kernel
>> messages
>> and see what the e1000 driver posted there for link speed messages (as
>> in "e1000:
>>  Link is UP speed XXX duplex YYY")
> 
> Shouldn't there then have been at least _some_ collisions reported in
> the stats?  And perhaps some late collisions?

well, from the documentation it sounds like the link was half-duplex, but
LF reported that it's not. This then points towards a medium issue (bad
cable) and after he replaced the cable, the issue went away (?).

I still don't fully grasp the reason why this counter would increment and
will investigate possibilities for that. My current suspicion is a physical
problem, most likely cable-related.

Auke
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread L F
On 9/17/07, Kok, Auke <[EMAIL PROTECTED]> wrote:
> The statistic we were looking at _will_ increase when running in half duplex,
> but if it increases when running in full duplex might indicate a hardware
> failure. Probably you have fixed the issue with the CAT6 cable.
Uhm, 'fixed' may be premature: I restarted the machine and with 22
hours uptime I am getting:
tx_deferred_ok: 36254

> Can you run this new configuration with the old cable? that would eliminate
> the cable (or not)
I most certainly can. This seems to have gotten worse by a factor or
100 or more.. so am I to suspect the new cable?

> A single port failure on a switch can also happen, and samba is definately
> a good test for defective hardware. I cannot rule out anything from the
> information we have gotten yet.
True, but I tried changing the switch ports with little change.
Putting a client on the same switch port yielded no errors on the
client, although unfortunately I don't have ethtool statistics on XP.
The switch, btw, is a fairly generic GS108 from Netgear (there
actually are two).

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: e1000 driver and samba

2007-09-17 Thread Brandeburg, Jesse
L F wrote:
> On 9/17/07, Kok, Auke <[EMAIL PROTECTED]> wrote:
>> The statistic we were looking at _will_ increase when running in
>> half duplex, but if it increases when running in full duplex might
>> indicate a hardware failure. Probably you have fixed the issue with
>> the CAT6 cable. 
> Uhm, 'fixed' may be premature: I restarted the machine and with 22
> hours uptime I am getting:
> tx_deferred_ok: 36254
> 
>> Can you run this new configuration with the old cable? that would
>> eliminate the cable (or not)
> I most certainly can. This seems to have gotten worse by a factor or
> 100 or more.. so am I to suspect the new cable?
> 
>> A single port failure on a switch can also happen, and samba is
>> definately 
>> a good test for defective hardware. I cannot rule out anything from
>> the information we have gotten yet.
> True, but I tried changing the switch ports with little change.
> Putting a client on the same switch port yielded no errors on the
> client, although unfortunately I don't have ethtool statistics on XP.
> The switch, btw, is a fairly generic GS108 from Netgear (there
> actually are two).

it may be not well documented, but the hardware has several states that
it can get into that can cause tx_deferred counter to increment.  None
of them are fatal to traffic, it is mainly an informational statistic.

in this case it is in the "due to receiving flow control; tx is paused"
state...

he has 488 rx flow control xoff/xon, which means the switch is being
overloaded and sending flow control, or the switch is passing through
flow control packets (which it should not since they are multicast) and
(some) client is overloaded.

can you turn off flow control at the server?  ethtool -A ethX rx off tx
off or load the driver with parameter FlowControl=0  With the 7.6.5
driver at least you'll get confirmation of the flow control change on
the "Link Up:" line.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-17 Thread Bill Fink
On Mon, 17 Sep 2007, Brandeburg, Jesse wrote:

> L F wrote:
> > On 9/17/07, Kok, Auke <[EMAIL PROTECTED]> wrote:
> >> The statistic we were looking at _will_ increase when running in
> >> half duplex, but if it increases when running in full duplex might
> >> indicate a hardware failure. Probably you have fixed the issue with
> >> the CAT6 cable. 
> > Uhm, 'fixed' may be premature: I restarted the machine and with 22
> > hours uptime I am getting:
> > tx_deferred_ok: 36254
> > 
> >> Can you run this new configuration with the old cable? that would
> >> eliminate the cable (or not)
> > I most certainly can. This seems to have gotten worse by a factor or
> > 100 or more.. so am I to suspect the new cable?
> > 
> >> A single port failure on a switch can also happen, and samba is
> >> definately 
> >> a good test for defective hardware. I cannot rule out anything from
> >> the information we have gotten yet.
> > True, but I tried changing the switch ports with little change.
> > Putting a client on the same switch port yielded no errors on the
> > client, although unfortunately I don't have ethtool statistics on XP.
> > The switch, btw, is a fairly generic GS108 from Netgear (there
> > actually are two).
> 
> it may be not well documented, but the hardware has several states that
> it can get into that can cause tx_deferred counter to increment.  None
> of them are fatal to traffic, it is mainly an informational statistic.
> 
> in this case it is in the "due to receiving flow control; tx is paused"
> state...
> 
> he has 488 rx flow control xoff/xon, which means the switch is being
> overloaded and sending flow control, or the switch is passing through
> flow control packets (which it should not since they are multicast) and
> (some) client is overloaded.
> 
> can you turn off flow control at the server?  ethtool -A ethX rx off tx
> off or load the driver with parameter FlowControl=0  With the 7.6.5
> driver at least you'll get confirmation of the flow control change on
> the "Link Up:" line.

It may also be a useful test to disable hardware TSO support
via "ethtool -K ethX tso off".

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-18 Thread Urs Thuermann
Bill Fink <[EMAIL PROTECTED]> writes:

> It may also be a useful test to disable hardware TSO support
> via "ethtool -K ethX tso off".

All suggestions here on the list, i.e. checking for flow control,
duplex, cable problems, etc. don't explain (at least to me) why LF
sees file corruption.  How can a corrupted frame pass the TCP checksum
check?  Does TCP use the hardware checksum of the NIC if available?
AFAICS, this would be the only way for a corrupt frame to make it into
the file.  But Bill already suggested this and LF reported that it
didn't make a difference.

A few months ago I had hadware problems with an embedded device, where
transmission from the NIC via the PCI bus to the CPU had some bits
flipped.  But tcpdump clearly showed the TCP checksum errors and also
TCP recognized the errors and the connection was stalled.  And, BTW,
we also observed an increasing percentage of corrupted frames with
increasing traffic on that interface, i.e. increasing load on the PCI
bus.

So I would run tcpdump -s0 and watch for "incorrect checksum" messages.


urs
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-18 Thread Bill Fink
On 18 Sep 2007, Urs Thuermann wrote:

> Bill Fink <[EMAIL PROTECTED]> writes:
> 
> > It may also be a useful test to disable hardware TSO support
> > via "ethtool -K ethX tso off".
> 
> All suggestions here on the list, i.e. checking for flow control,
> duplex, cable problems, etc. don't explain (at least to me) why LF
> sees file corruption.  How can a corrupted frame pass the TCP checksum
> check?  Does TCP use the hardware checksum of the NIC if available?
> AFAICS, this would be the only way for a corrupt frame to make it into
> the file.  But Bill already suggested this and LF reported that it
> didn't make a difference.
> 
> A few months ago I had hadware problems with an embedded device, where
> transmission from the NIC via the PCI bus to the CPU had some bits
> flipped.  But tcpdump clearly showed the TCP checksum errors and also
> TCP recognized the errors and the connection was stalled.  And, BTW,
> we also observed an increasing percentage of corrupted frames with
> increasing traffic on that interface, i.e. increasing load on the PCI
> bus.
> 
> So I would run tcpdump -s0 and watch for "incorrect checksum" messages.

I agree TSO is an unlikely candidate since it should only affect
transmits and the problem as I understand it is with receives.
But still one of the first things I try doing when dealing with
weird problems is disabling all hardware assists.

But I also agree with you that network errors should normally be
detected by the TCP checksum (unless hardware checksumming was
messed up), and from what I recall there were no receive checksum
errors being seen.  That and the fact that the problem was seen
with two different NICs would lead me to believe that the problem
is elsewhere in the system.

That leaves many possibilities.  It could be a memory problem,
although it was indicated that memory testing was successfully
performed (but we don't know how extensive the memory checking
is enabled via the BIOS).  It could be the PCI bus writes back
to the disk, or a problem with the disk/controller/fs writes
themselves (some kind of disk stress test might be useful).

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-18 Thread Florian Weimer
* Urs Thuermann:

> How can a corrupted frame pass the TCP checksum check?

The TCP/IP checksums are extremely weak.  If the corruption is due to
defective SRAM or something like that, it's likely that it causes an
error pattern which is 16-bit-aligned.  And an even number of
16-bit-aligned bit flips is not detected by the TCP checksum. 8-(

Actually, nobody should use TCP without application-level checksums
for that reason.  But of course, there is HTTP.

-- 
Florian Weimer<[EMAIL PROTECTED]>
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-18 Thread L F
This is the latest ethtool -S :
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 33491526
 tx_packets: 41410384
 rx_bytes: 28384277429
 tx_bytes: 46178788616
 rx_broadcast: 3144
 tx_broadcast: 2068
 rx_multicast: 79
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 79
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 36256
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 37420
 rx_flow_control_xoff: 37420
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 28384277429
 rx_csum_offload_good: 33478553
 rx_csum_offload_errors: 0
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0
There have been no further deferred frames, in accordance to what was
posted above. However, as Urs very correctly points out, the problem
is still very much there. I am waiting for new and improved cables
(third batch) to see if that makes a difference.
I am also wondering if, perhaps, a layer3 managed switch with port
statistics may prove helpful.
To answer Bill's questions, memtest+ 1.70 was run overnight, it
totalled some fifty passes with no errors. I know that some maintain
that memtest is not a definitive RAM test, but it certainly catches
most problems.
As to the disk side of the equation, that is of course a possible
concern and I will be running a second, more intensive batch of tests.
However, the fact that virtualbox (running on the server) reports no
errors despite running through significant amounts of data (10GB+,
passing through samba via the tap device that is bridged to the
problematic ethernet) every day makes me fairly confident that the
disk side of the equation is not where the problem lies.

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-18 Thread Bill Fink
On Tue, 18 Sep 2007, Florian Weimer wrote:

> * Urs Thuermann:
> 
> > How can a corrupted frame pass the TCP checksum check?
> 
> The TCP/IP checksums are extremely weak.  If the corruption is due to
> defective SRAM or something like that, it's likely that it causes an
> error pattern which is 16-bit-aligned.  And an even number of
> 16-bit-aligned bit flips is not detected by the TCP checksum. 8-(
> 
> Actually, nobody should use TCP without application-level checksums
> for that reason.  But of course, there is HTTP.

But in this specific case, IIRC there were _no_ receive checksum
errors seen, and it would seem odd that any bit corruption was
_always_ an even number of 16-bit-aligned bit flips.

Also, I don't know anything at all about the SAMBA fs/protocol, but
I would expect it would have some kind of stronger data integrity
capability that should catch such errors.  Which would be another
reason implying the data corruption problem is above the network
layer, and perhaps a hardware error of some kind on the write path
to the disk (also could possibly be a software bug of some kind
in that path).

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: e1000 driver and samba

2007-09-18 Thread Tantilov, Emil S
I ran a test here in the lab and was not able to reproduce the issue
you're describing. My setup is with 2 systems connected through a
100Mbit switch. Client runs Windows XP and the server is Linux RHEL5
with updated samba 3.0.26a (latest I could download from samba.org).

I was able to copy 1GB zip file back and forth and also decompress it
from/to the shared drive without data corruption. 

Have you looked into the data corruption - what is the difference
between the files? If the files have the same size you should be able to
see the diff with hexdump for example. 

Could you provide your samba config?

Thanks,
Emil

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of L F
Sent: Tuesday, September 18, 2007 9:32 AM
To: Florian Weimer
Cc: Urs Thuermann; Bill Fink; Brandeburg, Jesse; Kok, Auke-jan H; James
Chapman; netdev@vger.kernel.org
Subject: Re: e1000 driver and samba

This is the latest ethtool -S :
beehive:~# ethtool -S eth4
NIC statistics:
 rx_packets: 33491526
 tx_packets: 41410384
 rx_bytes: 28384277429
 tx_bytes: 46178788616
 rx_broadcast: 3144
 tx_broadcast: 2068
 rx_multicast: 79
 tx_multicast: 0
 rx_errors: 0
 tx_errors: 0
 tx_dropped: 0
 multicast: 79
 collisions: 0
 rx_length_errors: 0
 rx_over_errors: 0
 rx_crc_errors: 0
 rx_frame_errors: 0
 rx_no_buffer_count: 0
 rx_missed_errors: 0
 tx_aborted_errors: 0
 tx_carrier_errors: 0
 tx_fifo_errors: 0
 tx_heartbeat_errors: 0
 tx_window_errors: 0
 tx_abort_late_coll: 0
 tx_deferred_ok: 36256
 tx_single_coll_ok: 0
 tx_multi_coll_ok: 0
 tx_timeout_count: 0
 tx_restart_queue: 0
 rx_long_length_errors: 0
 rx_short_length_errors: 0
 rx_align_errors: 0
 tx_tcp_seg_good: 0
 tx_tcp_seg_failed: 0
 rx_flow_control_xon: 37420
 rx_flow_control_xoff: 37420
 tx_flow_control_xon: 0
 tx_flow_control_xoff: 0
 rx_long_byte_count: 28384277429
 rx_csum_offload_good: 33478553
 rx_csum_offload_errors: 0
 rx_header_split: 0
 alloc_rx_buff_failed: 0
 tx_smbus: 0
 rx_smbus: 0
 dropped_smbus: 0
There have been no further deferred frames, in accordance to what was
posted above. However, as Urs very correctly points out, the problem
is still very much there. I am waiting for new and improved cables
(third batch) to see if that makes a difference.
I am also wondering if, perhaps, a layer3 managed switch with port
statistics may prove helpful.
To answer Bill's questions, memtest+ 1.70 was run overnight, it
totalled some fifty passes with no errors. I know that some maintain
that memtest is not a definitive RAM test, but it certainly catches
most problems.
As to the disk side of the equation, that is of course a possible
concern and I will be running a second, more intensive batch of tests.
However, the fact that virtualbox (running on the server) reports no
errors despite running through significant amounts of data (10GB+,
passing through samba via the tap device that is bridged to the
problematic ethernet) every day makes me fairly confident that the
disk side of the equation is not where the problem lies.

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-19 Thread L F
Well,
the issue seems to have gone away as of this morning, but I am
somewhat unsure as to why.
Placement of some things were modified so as to allow shorter cables.
Now there are 3' CAT6 cables everywhere except for the 15' cable
between the two switches. All the cables are new, high quality
'tested' cables from a company nearby.
The server is now running 2.6.22.6 with the 7.6.5 e1000 driver from
intel.com and samba 3.0.26-1 ... and it seems to work. Samba will not
disconnect, even with all 8 clients running unreasonable read/write
loads and CRC and MD5 checksums of the transferred files all match.
The issue therefore seems to have gone away, but the reason why still
escapes me. I cannot believe that CAT5 cables under 10' in length were
causing it, because if that were the case
1) it would've shown itself, I presume, from the beginning
2) I could name dozens of different locations which would be having
the same problems
Samba 3.0.25 was definitely part of the problem and I sent a nice
nastygram to the debian maintainers, because -testing is not
-unstable, last I checked.
As to samba having any sort of data integrity capability, to the best
of my knowledge that has never been the case.
To answer further questions: I checked for file integrity with
CRC/CRC32/MD5 checksum utilities. They used to fail fairly
consistently, they have been fine all this morning.
Here is my samba config, for reference, comments etc. stripped.

[global]
   workgroup = WORKGROUP
   server string = %h server
   wins support = yes
   dns proxy = yes
   name resolve order = host wins bcast
   log level = 1
   max log size = 1000
   syslog = 0
   panic action = /usr/share/samba/panic-action %d
   encrypt passwords = true
   passdb backend = tdbsam
   obey pam restrictions = yes
   invalid users = root
   passwd program = /usr/bin/passwd %u
   passwd chat = *Enter\snew\sUNIX\spassword:* %n\n
*Retype\snew\sUNIX\spassword:* %n\n .
   socket options = TCP_NODELAY IPTOS_LOWDELAY
   domain master = yes
[backups]
   comment = Backup Share
   path = /var/archive/backups
   browseable = yes
   writeable = no
   guest ok = no
   write list = samba
   force user = samba
[downloads]
   comment = Downloads Share
   path = /var/archive/downloads
   browseable = yes
   writeable = no
   guest ok = no
   read list = samba
   write list = samba
   force user = samba

There is nothing there that I would deem unusual. Since the transition
to 2.6 kernels I have been omitting the buffer statements in the
socket options.
I have one further question: what should I be doing with the TSO and
flow control? As of now, TSO is on but flow control is off.
I'd like to thank everyone who helped and I'll be trying to see if the
realtek integrated NIC works next.

Luigi Fabio
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-19 Thread Bill Fink
On Wed, 19 Sep 2007, L F wrote:

> I have one further question: what should I be doing with the TSO and
> flow control? As of now, TSO is on but flow control is off.
> I'd like to thank everyone who helped and I'll be trying to see if the
> realtek integrated NIC works next.

Just my personal opinion, but unless you want to do more testing,
since you now seem to have a working setup, I would tend to leave
it the way it is.

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-19 Thread Bill Fink
On Wed, 19 Sep 2007, L F wrote:

> Well,
> the issue seems to have gone away as of this morning, but I am
> somewhat unsure as to why.
> Placement of some things were modified so as to allow shorter cables.
> Now there are 3' CAT6 cables everywhere except for the 15' cable
> between the two switches. All the cables are new, high quality
> 'tested' cables from a company nearby.
> The server is now running 2.6.22.6 with the 7.6.5 e1000 driver from
> intel.com and samba 3.0.26-1 ... and it seems to work. Samba will not
> disconnect, even with all 8 clients running unreasonable read/write
> loads and CRC and MD5 checksums of the transferred files all match.
> The issue therefore seems to have gone away, but the reason why still
> escapes me. I cannot believe that CAT5 cables under 10' in length were
> causing it, because if that were the case
> 1) it would've shown itself, I presume, from the beginning
> 2) I could name dozens of different locations which would be having
> the same problems
> Samba 3.0.25 was definitely part of the problem and I sent a nice
> nastygram to the debian maintainers, because -testing is not
> -unstable, last I checked.
> As to samba having any sort of data integrity capability, to the best
> of my knowledge that has never been the case.
> To answer further questions: I checked for file integrity with
> CRC/CRC32/MD5 checksum utilities. They used to fail fairly
> consistently, they have been fine all this morning.

By any chance did you happen to power cycle some equipment in this
process that you didn't previously power cycle during earlier testing
and debugging?  If so, perhaps that hardware had somehow gotten into
a funky state, and the power cycling might have cleared it up.

Just a thought.

-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-20 Thread Bruce Cole

If you look for it on the Realtek cards, there had been sporadic
Nissues up to late 2005. The solution posted universally was 'change
card'.


Yes, that *was* the common recommendation.  But recently I narrowed down the
realtek performance problem most commonly seen with samba (but also applicable
to other TCP applications), and I also narrowed down the fix as well.

The current fix involves re-kicking the TX queue after it becomes stuck.
Apparently it becomes stuck due to a contention problem between the driver and
controller.  I suspect the root problem is the driver isn't properly locking
the TX queue.  It might be worth checking if the queue locking problem exists
in other net drivers as well. 


Reference:
http://www.spinics.net/lists/netdev/msg40384.html


-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread L F
On 9/20/07, Bruce Cole <[EMAIL PROTECTED]> wrote:
> Yes, that *was* the common recommendation.  But recently I narrowed down the
> realtek performance problem most commonly seen with samba (but also applicable
> to other TCP applications), and I also narrowed down the fix as well.
>
> The current fix involves re-kicking the TX queue after it becomes stuck.
> Apparently it becomes stuck due to a contention problem between the driver and
> controller.  I suspect the root problem is the driver isn't properly locking
> the TX queue.  It might be worth checking if the queue locking problem exists
> in other net drivers as well.
Aha. This doesn't seem to be in mr. Romieu's patch above: should it go
in on top of that?
I ask because with the forementioned patch the newer integrated NICs
seem to be recognised correctly and preliminary testing shows no
disconnect issues, but performance is nothing to write home about (one
of these days I'll get into a rant about samba speed vs. ftp speed,
but this is not the time nor place).

> Reference:
> http://www.spinics.net/lists/netdev/msg40384.html

> Bruce Cole

LF
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread L F
On 9/19/07, Bill Fink <[EMAIL PROTECTED]> wrote:
> Just my personal opinion, but unless you want to do more testing,
> since you now seem to have a working setup, I would tend to leave
> it the way it is.
Quite sensible, yes. Performance even seems to be good - I am getting
40-40MBps reads and 24-26MBps writes - so it'll stay the way it is.

> By any chance did you happen to power cycle some equipment in this
> process that you didn't previously power cycle during earlier testing
> and debugging?  If so, perhaps that hardware had somehow gotten into
> a funky state, and the power cycling might have cleared it up.
Not that I am aware of: one of the first things that I did - and
repeated basically every step of the way - was to powercycle the two
switches, following the same line of reasoning you did. The clients
were turned off every night and turned back on every morning and the
WAN Comcast CPE wasn't touched for the duration. The only thing that
did change is that in an impetus of efficiency or perhaps desperation
I changed that cable too (to CAT6, 3' long), but I can't imagine that
would affect the LAN side of operations.
Thanks again - to everyone - for the help. I am still puzzled, but at
least I am puzzled with a consistent situation.
To Mr. Romieu: the patch you provided seems to work, in that 'regular'
loads don't trip samba up. I have to check the CRCs, though.

> -Bill

Luigi Fabio
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread Bruce Cole

L F wrote:

Aha. This doesn't seem to be in mr. Romieu's patch above: should it go
in on top of that?
  

His newer

0002-r8169-workaround-against-ignored-TxPoll-writes-8168.patch
does the same thing as the older quoted version, and is also
included in the roll-up patch he pointed you to.

I agree re. performance, and don't want to clutter the thread
realtek issues.  I just wanted to clarify this particular
realtek-samba issue in case it's relevant.  For a long time folks
suspected similar culprits (bad cable, half duplex, etc.) when
it was the driver.  Perhaps you should try doing a ping flood 
between server&client during your test.  That healed the

problem by keeping the queue busy in the realtek case.





-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread Francois Romieu
Bruce Cole <[EMAIL PROTECTED]> :
> >If you look for it on the Realtek cards, there had been sporadic
> >Nissues up to late 2005. The solution posted universally was 'change
> >card'.
> 
> Yes, that *was* the common recommendation.

There was no such thing as a universal solution to sporadic issues.

[...]
> I suspect the root problem is the driver isn't properly locking
> the TX queue.

Can you be more specific ?

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread Bruce Cole

Francois Romieu wrote:

Bruce Cole <[EMAIL PROTECTED]> :
  

If you look for it on the Realtek cards, there had been sporadic
Nissues up to late 2005. The solution posted universally was 'change
card'.
  

Yes, that *was* the common recommendation.



There was no such thing as a universal solution to sporadic issues.
  
I made no such claim.  I do claim the realtek samba et all issues are 
not sporadic however.  In fact the

common problem is readily reproducible as has been shown.

[...]
  

I suspect the root problem is the driver isn't properly locking
the TX queue.



Can you be more specific ?
  

Yes per the reference I gave:
http://www.spinics.net/lists/netdev/msg40384.html
"Now since this change heals the TX queue stall, it would seem that the real

underlying problem involves a race condition with enqueueing to the TX queue

while the controller is processing the queue. The ultimate fix for that 
I bet is either to address locking at TX enqueue time, or there is a 
controller bug. Any clarification from realtek on the necessary 
processing for the NPQ bit, or


a known controller problem?

PS: I've also received private email that this problem pertains to video
streaming (to a Kiss DVD player) not just samba or X11 traffic.  Basically
most all high-level TCP based protocols are affected it seems.  This serious

performance problem should be considered to impact a lot more than just 
samba


users."

I could probably help fix the underlying problem but I didn't
receive any response to my post quoted above.





-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: e1000 driver and samba

2007-09-21 Thread Francois Romieu
Bruce Cole <[EMAIL PROTECTED]> :
> Francois Romieu wrote:
[...]
> >Can you be more specific ?
> >  
> Yes per the reference I gave:
> http://www.spinics.net/lists/netdev/msg40384.html
[...]

Ok, I wondered if you had found something between the start_xmit and the
Tx completion code.

[...]
> I could probably help fix the underlying problem but I didn't
> receive any response to my post quoted above.

I have submitted the smallest workaround to Jeff. It is not necessarily
the best wrt performance but this part is not trivial to arbitrate.

-- 
Ueimor
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html