Re: [Telrad] Uplink throughput again

Jeremy Austin Thu, 16 Mar 2017 07:45:07 -0700

Sorry, early morning WISPAMERICA brain.

What I meant to ask, Nathan, was how you captured both sides of PDN/EPC
traffic isolated from the CPE7000? The ingress I can understand, but the
egress would be S1 encapsulated, no?


This could make a great forum post -- I can imagine needing it myself
someday.

Thanks again.

On Thu, Mar 16, 2017 at 9:01 AM, Jeremy Austin <jhaus...@gmail.com> wrote:

> Nathan, thanks for the clarification.
>
> On Thu, Mar 16, 2017 at 8:54 AM Nathan Anderson <nath...@fsr.com> wrote:
>
>> Just an update to this: at the direction of Telrad support, I ran 2
>> simultaneous packet captures during a download where corruption occurred:
>> one right at the point of ingress at the EPC, and one right at the point of
>> egress.
>>
>>
>>
>> It turns out that I was WRONG about part of this.  The EPC is definitely
>> corrupting traffic in the newer firmwares we have been given, as the
>> captures demonstrate, but it is NOT also regenerating the TCP payload
>> checksums on every packet that flows through it, thank goodness.  No, it
>> turns out that the reason these payloads are making it all the way to the
>> user is because the CPE7000's NAT engine is the one completely recomputing
>> the checksums, instead of properly modifying them to only reflect the
>> changes that it makes to the headers (see 
>> https://www.ietf.org/rfc/rfc1631.txt
>> section 3.3).  So this is a two-parter: the EPC is corrupting bits, and
>> the CPE7000 is responsible for covering up the corruption.
>>
>>
>>
>> I tested with a CPE8000, and its NAT engine is doing the right thing.
>> Thus, the corrupt packets make it to the client, which sees the invalid
>> checksum, and which tosses the packet, triggering retransmit.
>>
>>
>>
>> The EPC firmware we have been using is a development build, and the
>> corruption bug appears to be unique to that.  But the CPE7000 firmware we
>> used for testing was the latest public release (116).
>>
>>
>>
>> -- Nathan
>>
>>
>>
>> *From:* telrad-boun...@wispa.org [mailto:telrad-boun...@wispa.org] *On
>> Behalf Of *Nathan Anderson
>> *Sent:* Wednesday, March 15, 2017 1:47 AM
>> *To:* telrad@wispa.org
>>
>>
>> *Subject:* Re: [Telrad] Uplink throughput again
>>
>>
>>
>> This is exactly it.  We didn't have the visibility into things to see
>> what was causing the poor throughput at first (yet another one of our
>> longstanding frustrations with the platform), but this is the problem that
>> Jeremy and I were referring to.
>>
>>
>>
>> I'm glad to say that we have not (knowingly) experienced the CPU usage
>> fluctuations on our EPCs.
>>
>>
>>
>> As far as the data corruption one, you likely will not have run up
>> against it unless you are running a preproduction release of 6.7.  The
>> symptoms are that we will see clusters of 4 consecutive bytes that have
>> various bits flipped (usually what happens is that bytes 1 and 2 are zeroed
>> out, and bytes 3 and 4 are completely different than what they would
>> normally be, but the pattern of what exactly is changed is not clear to us
>> yet).  We see on average between 12 and 60 bytes per 100MB transferred per
>> user in this state.  The VERY BAD and VERY SCARY part is that if you do a
>> packet capture, you will see that exactly zero TCP packets have a checksum
>> that does not validate.  So it's not like data is getting corrupted, and a
>> lot of packets are being thrown out because the checksum doesn't
>> compute/match, but a small percentage or handful get through.  No, every
>> single packet has a valid checksum, even the ones with corrupt data in
>> them.  What this means is that 1) HTTPS transfers just stop and die when
>> the corruption occurs, and 2) HTTP/FTP/other unencrypted transfers
>> introduce silent data corruption into the download that you won't discover
>> until it is too late.
>>
>>
>>
>> That all packets have a checksum that validates would seem to suggest
>> that the EPC is ingesting TCP packets from the PDN interface, throwing out
>> the original TCP checksum (as a shortcut, or...? what valid reasons would
>> you possibly have for doing this?), doing something internally that causes
>> random corruption, and then recomputing a new checksum from scratch before
>> sending it onto the target user over S1-U.  That a bug like this is even
>> *possible* BLOWS MY MIND.  If you're going to ignore the original checksum
>> that the packet arrives with, what's the point of the checksum in the first
>> place?  How can I ever trust the data flowing through this device again
>> knowing that it is working around and subverting a key component that helps
>> to ensure and preserve data integrity?
>>
>>
>>
>> -- Nathan
>>
>>
>>
>> *From:* telrad-boun...@wispa.org [mailto:telrad-boun...@wispa.org
>> <telrad-boun...@wispa.org>] *On Behalf Of *Adam Moffett
>> *Sent:* Tuesday, March 14, 2017 8:34 PM
>> *To:* telrad@wispa.org; telrad@wispa.org
>> *Subject:* Re: [Telrad] Uplink throughput again
>>
>>
>>
>> * UE getting stuck at MCS4....apparently until an S1 reset.  This may or
>> may not be the same throughput issue that you guys were talking about
>> earlier in the thread.
>> _______________________________________________
>> Telrad mailing list
>> Telrad@wispa.org
>> http://lists.wispa.org/mailman/listinfo/telrad
>>
>


-- 
Jeremy Austin

(907) 895-2311
(907) 803-5422
jhaus...@gmail.com

Heritage NetWorks
Whitestone Power & Communications
Vertical Broadband, LLC

Schedule a meeting: http://doodle.com/jermudgeon

_______________________________________________
Telrad mailing list
Telrad@wispa.org
http://lists.wispa.org/mailman/listinfo/telrad

Re: [Telrad] Uplink throughput again

Reply via email to