Reviews of draft-ietf-bmwg-sip-bench-term-08 and draft-ietf-bmwg-sip-bench-meth-08

Summary: These drafts are not ready for publication as RFCs.

First, some of the text in these documents shows signs of being old, and the
working group may have been staring at them so long that they've become hard to see. The terminology document says "The issue of overload in SIP networks is
currently a topic of discussion in the SIPPING WG." (SIPPING was closed in
2009). The methodology document suggests a "flooding" rate that is orders of
magnitude below what simple devices achieve at the moment. That these survived working group last call indicates a different type of WG review may be needed
to groom other bugs out of the documents.

Who is asking for these benchmarks, and are they (still) participating in the
group?  The measurements defined here are very simplistic and will provide
limited insight into the relative performance of two elements in a real
deployment. The documents should be clear about their limitations, and it would be good to know that the community asking for these benchmarks is getting tools that will actually be useful to them. The crux of these two documents is in the
last paragraph of the introduction to the methodology doc: "Finally, the
overall value of these tests is to serve as a comparison function between
multiple SIP implementations". The documents punt on providing any comparison
guidance, but even if we assume someone can figure that out, do these
benchmarks provide something actually useful for inputs?

It would be good to explain how these documents relate to RFC6076.

The terminology tries to refine the definition of session, but the definition provided, "The combination of signaling and media messages and processes that support a SIP-based service" doesn't answer what's in one session vs another. Trying to generically define session has been hard and several working groups
have struggled with it (see INSIPID for a current version of that
conversation). This document doesn't _need_ a generic definition of session - it only needs to define the set of messages that it is measuring. It would be much clearer to say "for the purposes of this document, as session is the set of SIP messages associated with an INVITE initiated dialog and any Associated
Media, or a series of related SIP MESSAGE requests". (And looking at the
benchmarks, you aren't leveraging related MESSAGE requests - they all appear to
be completely independent). Introducing the concepts of Invite-initiated
sessions and non-invite-initiated sessions doesn't actually help define the
metrics. When you get to the metrics, you can speak concretely in terms of a
series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing a short
introduction pointing folks with PSTN backgrounds relating these to "Session
Attempts" will be clearer.

To be clear, I strongly suggest a fundamental restructuring of the document to describe the benchmarks in terms of dialogs and transactions, and remove the IS
and NS concepts completely.

The INVITE related tests assume no provisional responses, leaving out the
effect on a device's memory when the state machines it is maintaining transition to the proceeding state. Further, by not including provisionals, and building the tests to search for Timer B firing, the tests insure there will be multiple retransmissions of the INVITE (when using UDP) that the device being tested has to handle. The traffic an element has to handle and likely the memory it will consume will be very different with even a single 100 trying, which is the more
usual case in deployed networks. The document should be clear _why_ it chose
the test model it did and left out metrics that took having a provisional
response into account. Similarly, you are leaving out the delayed-offer INVITE transactions used by 3pcc and it should be more obvious that you are doing so.

Likewise, the media oriented tests take a very basic approach to simulating
media. It should be explicitly stated that you are simulating the effects of a codec like G.711 and that you are assuming an element would only be forwarding
packets and has to do no transcoding work. It's not clear from the documents
whether the EA is generating actual media or dummy packets. If it's actual
media, the test parameters that assume constant sized packets at a constant
rate will not work well for video (and I suspect endpoints, like B2BUAs, will
terminate your call early if you send them garbage).

The sections on a series of INVITEs is fairly clear that you mean each of them
to have different dialog identifiers.  I don't see any discussion of varying
the To: URI. If you don't, what's going to keep a gateway or B2BUA from
rejecting all but the first with something like Busy? Similarly, I'm not
finding where you talk about how many AoRs you are registering against in the registration tests. I think, as written, someone could write this where all the
registers affected only one AoR.

The methodology document calls Stress Testing out of scope, but the very nature of the Benchmarking algorithm is a stress test. You are iteratively pushing to see at what point something fails, _exactly_ by finding the rate of attempted
sessions per second that the thing under test would consider too high.

Now to specific issues in document order, starting with the terminology
document (nits are separate and at the end):

* T (for Terminology document): The title and abstract are misleading - this is
not general benchmarking for SIP performance. You have a narrow set of
tests, gathering metrics on a small subset of the protocol machinery.
Please (as RFC 6076 did) look for a title that matches the scope of the
document. For instance, someone testing a SIP Events server would be ill-served
with the benchmarks defined here.

* T, section 1: RFC5393 should be a normative reference. You probably also need to pull in RFCs 4320 and 6026 in general - they affect the state machines you
are measuring.

* T, 3.1.1: As noted above, this definition of session is not useful. It
doesn't provide any distinction between two different sessions. I strongly
disagree that SIP reserves "session" to describe services analogous to
telephone calls on a switched network - please provide a reference. SIP INVITE
transactions can pend forever - it is only the limited subset of the use of
the transactions (where you don't use a provisional response) that keeps this communication "brief". In the normal case, an INVITE an its final response can
be separated by an arbitrary amount of time. Instead of trying to tweak this
text, I suggest replacing all of it with simpler, more direct descriptions of
the sequence of messages you are using for the benchmarks you are defining
here.

*T, 3.1.1: How is this vector notion (and graph) useful for this document? I
don't see that it's actually used anywhere in the documents. Similarly, the
arrays don't appear to be actually used (though you reference them from some
definitions) - What would be lost from the document if you simply removed all
this text?

*T, 3.1.5, Discussion, last sentence: Why is it important to say "For UA-type of network devices such as gateways, it is expected that the UA will be driven into overload based on the volume of media streams it is processing." It's not
clear that's true for all such devices. How is saying anything here useful?

*T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session
Attempt. Why not just say INVITE? You aren't actually measuring "session
attempts" for INVITEs or REGISTERs - you have separate benchmarks for them.

*T, 3.1.7: It needs to be explicit that these benchmarks are not accounting
for/allowing early dialogs.

*T, 3.1.8: The words "early media" appear here for the first time. Given the
way the benchmarks are defined, does it make sense to discuss early media in
these documents at all (beyond noting you do not account for it)? If so,
there needs to be much more clarity. (By the way, this Discussion will be
much easier to write in terms of dialogs).

*T, 3.1.9, Discussion point 2: What does "the media session is established"
mean? If you leave this written as a generic definition, then is this when an
MSRP connection has been made? If you simplify it to the simple media model
currently in the document, does it mean an RTP packet has been sent? Or does it
have to be received?. For the purposes of the benchmarks defined here, it
doesn't seem to matter, so why have this as part of the discussion anyway?

*T, 3.1.9, Definition: A series of CANCELs meets this definition.

*T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and they aren't
covered elsewhere in the document.

*T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Why is it in
this document and not -meth-?

*T, 3.1.11 Discussion, next to last sentence: "measured by the number of
distinct Call-IDs" means you are not supporting forking, or you would not count
answers from more than on leg of the fork as different sessions, like you
should. Or are you intending that there would never be an answer from more than
one leg of a fork? If so, the documents need to be clearer about the
methodology and what's actually being measured.

*T, 3.2.2 Definition: There's something wrong with this definition. For
example, proxies do not create sessions (or dialogs). Did you mean "forwards
messages between"?

*T, 3.2.2 Discussion: This is definition by enumeration since it uses a MUST, and is exclusive of any future things that might sit in the middle. If that's what you want, make this the definition. The MAY seems contradictory unless you
are saying a B2BUA or SBC is just a specialized User Agent Server. If so,
please say it that way.

*T, 3.2.3: This seems out of place or under-explored.  You don't appear to
actually _use_ this definition in the documents.You declare these things in
scope, but the only consequence is the line in this section about the not
lowering performance benchmarks when present. Consider making that part of the
methodology of a benchmark and removing this section. If you think it's
essential, please revisit the definition - you may want to generalize it into
_anything_ that sits on the path and may affect SIP processing times
(otherwise, what's special about this either being SIP Aware, or being a
Firewall)?

*T, 3.2.5 Definition: This definition just obfuscates things. Point to 3261's definition instead. How is TCP a measurement unit? Does the general terminology
template include "enumeration" as a type? Do you really want to limit this
enumeration to the set of currently defined transports? Will you never run
these benchmarks for SIP over websockets?

*T, 3.3.2 Discussion: Again, there needs to be clarity about what it means to "create" a media session. This description differentiates attempt vs success, so what is it exactly that makes a media session attempt successful? When you say number of media sessions, do you mean number of M lines or total number of
INVITEs that have SDP with m lines?

*T, 3.3.3: This would much clearer written in terms of transactions and dialogs
(you are already diving into transaction state machine details). This is a
place where the document needs to point out that it is not providing benchmarks
relevant to environments where provisionals are allowed to happen and INVITE
transactions are allowed to pend.

*T, 3.3.4: How does this model (A single session duration separate from the
media session hold time) produce useful benchmarks? Are you using it to allow media to go beyond the termination of a call? If not, then you have media only
for the first part of a call? What real world thing does this reflect?
Alternatively, what part of the device or system being benchmarked does this
provide insight to?

*T, 3.3.5: The document needs to be honest about the limits of this simple
model of media. It doesn't account for codecs that do not have constant packet sizes. The benchmarks that use the model don't capture the differences based on
content of the media being sent - a B2BUA or gateway, may will behave
differently if it is transcoding or doing content processing (such as DTMF
detection) than it will if it is just shoveling packets without looking at
them.

*T, 3.3.6: Again, the model here is that any two media packets present the same
load to the thing under test. That's not true for transcoding, mixing, or
analysis (such as for dtmf detection). It's not clear that if you have two
streams, each stream has its own "constant rate". You call out having one audio
and one video stream - how do you configure different rates for them?

*T, 3.3.7: This document points to the methodology document for indicating
whether streams are bi-directional or uni-directional. I cant find where the
methodology document talks about this (the string 'direction' does not
occur in that document).

*T, 3.3.8: This text is old - it was probably written pre-RFC5056. If you fork,
loop detection is not optional. This, and the methodology document should be
updated to take that into account.

*T, 3.3.9: Clarify if more than one leg of a fork can be answered successfully
and update 3.1.11 accordingly. Talk about how this affects the success
benchmarks (how will the other legs getting failure responses affect the
scores?)

*T, 3.3.9, Measurement units: There is confusion here. The unit is probably
"endpoints". This section talks about two things, that, and type of forking.
How is "type of forking" a unit, and are these templates supposed to allow more
than one unit for a term?

*T, 3.4.2, Definition: It's not clear what "successfully completed" means. Did you mean "successfully established"? This is a place where speaking in terms of
dialogs and transactions rather than sessions will be much clearer.

*T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the
context of the methodology document (where the docs come closer to defining it). This definition includes a variable T but doesn't explain it - you have to read
the methodology to know what T is all about. You might just say "for the
duration of the test" or whatever is actually correct.

*T, 3.4.3, Discussion: "Media Session Hold Time MUST be set to infinity". Why?
The argument you give in the next sentence just says the media session hold
time has to be at least as long as the session duration. If they were equal,
and finite, the test result does not change. What's the utility of the infinity
concept here?

*T, 3.4.4: "until it stops responding". Any non-200 response is still a
response, and if something sends a 503 or 4xx with a retry-after (which is
likely when it's truly saturating) you've hit the condition you are trying to
find. The notion that the Overload Capacity is measurable by not getting any
responses at all is questionable. This discussion has a lot of methodology in
it - why isn't that (only) in the methodology document?

*T, 3.4.5: A normal, fully correct system that challenged requests and
performed flawlessly would have a .5 Session Establishment Performance score. Is that what you intended? The SHOULD in this section looks like methodology. Why is this a SHOULD and not a MUST (the document should be clearer about why sessions remaining established is important). Or wait - is this what Note 2 in section 5.1 of the methodology document (which talks about reporting formats) is supposed to change? If so, that needs to be moved to the actual methodology
and made _much_ clearer.

*T, 3.4.6: You talk of the first non-INVITE in an NS. How are you
distinguishing subsequent non-INVITES in this NS from requests in some other
NS? Are you using dialog identifiers or something else? Why do you expect that to matter (why is the notion of a sequence of related non-INVITEs useful from a benchmarking perspective - there isn't state kept in intermediaries because of
them - what will make this metric distinguishable from a metric that just
focuses on the transactions?)

*T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or
some other end-to-end non-INVITE? I suspect it's because you are wanting to
focus on a simple non-INVITE transaction (which is why you are leaving out
SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear
that's why you chose it. You should also talk about whether the payload of all of the MESSAGE requests are the same size and whether that size is a parameter
to the benchmark. (You'll likely get very different behavior from a MESSAGE
that fragments.)

*T, 3.4.7: The definition says "messages completed" but the discussion talks
about "definition of success". Does success mean an IM transaction completed
successfully?  If so, the definition of success for a UAC has a problem. As
written, it describes a binary outcome for the whole test, not how to determine
the success of an individual transaction - how do you get from what it
describes to a rate?

*T, Appendix A: The document should better motivate why this is here.
Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are
silent on them.  The discussion says you are _selecting_ a Session Attempts
Arrival Rate distribution. It would be clearer to say you are selecting the
distribution of messages sent from the EA. It's not clear how this particular
metric will benefit from different sending distributions.

Now the Methodology document (comments prefixed with an M):

*M, Introduction: Can the document say why the subset of functionality
benchmarked here was chosen over other subsets? Why was Subscribe/Notify or
Info not included (or invites with MSRP or even simple early media, etc)?

*M, Introduction paragraph 4: This points to section 4 and section 2 of the
terminology document for configuration options. Section 4 is the iana
considerations section (which has no options). What did you mean to point to?

*M, Introduction paragraph 4, last sentence: This seems out of place - why is
it in the introduction and not in a section on that specific methodology.

*M, 4.1: It's not clear here, or in the methodology sections whether the tests allow the transport to change as you go across an intermediary. Do you intend to be able to benchmark a proxy that has TCP on one side and UDP on the other?

*M, 4.2: This is another spot where pointing to the Updates to 3261 that change
the transaction state machines is important.

*M, 4.4: Did you really mean RTSP? Maybe you meant MSRP or something else? RTSP
is not, itself, a media protocol.

*M, 4.9: There's something wrong with this sentence: "This test is run for an
extended period of time, which is referred to as infinity, and which is,
itself, a parameter of the test labeled T in the pseudo-code". What value is
there in giving some finite parameter T the name "infinity"?

*M, 4.9: Where did 100 (as an initial value for s) come from? Modern devices
process at many orders of magnitude higher rates than that. Do you want to
provide guidance instead of an absolute number here?

*M 4.9: In the pseudo-code, you often say "the largest value". It would help to
say the the largest value of _what_.

*M 4.9: What is the "steady_state" function called in the pseudo-code?

*M 6.3: Expected Results: The EA will have different performance
characteristics if you have them sending media or not. That could cause this
metric to be different from session establishment without media.

*M 6.5: This section should call out that loop detection is not optional when
forking. The Expected Results description is almost tautological - could it
instead say how having this measurement is _useful_ to those consuming this
benchmark?

*M 6.8, Procedure: Why is "May need to run for each transport of interest." in
a section titled "Session Establishement Rate with TLS Encrypted SIP"?

*M 6.10: This document doesn't define Flooding. What do you mean? How is this different than "Stress test" as called out in section 4.8? Where does 500 come from? (Again, I suspect that's a very old value - and you should be providing
guidance rather than an absolute number). But it's not clear how this isn't
just the session establishment rate test that just starts with a bigger number.
What is it actually trying to report on that's different from the session
establishment rate test, and how is the result useful?

*M 6.11: Is each registration going to a different AoR? (You must be, or the
re-registration test makes no sense.) You might talk about configuring the
registrar and the EA so they know what to use.

*M 6.12, Expected Results: Where do you get the idea that re-registration
should be faster than initial registration? How is knowing the difference (or
even that there is a difference) between this and the registration metric
likely to be useful to the consumer?

*M 6.14: Session Capacity, as defined in the terminology doc, is a count of
sessions, not a rate. This section treats it as a rate and says it can be
interpreted as "throughput". I'm struggling to see what it actually is
measuring. The way your algorithm is defined in section 4.9, I find s before I use T. Lets say I've got a box where the value of s that's found is 10000, and I've got enough memory that I can deal with several large values of T. If I run this test with T of 50000, my benchmark result is 500,000,000. If I run with a
T of 100000, my benchmark result is 1,000,000,000. How are those numbers
telling me _anything_ about session capacity. That the _real_ session capacity is at least that much? Is there some part of this methodology that has me hunt for a maximal value of T? Unless I've missed something, this metric needs more
clarification to not be completely misleading. Maybe instead of "Session
Capacity" you should simply be reporting "Simultaneous Sessions Measured"

*M 8: "and various other drafts" is not helpful - if you know of other
important documents to point to, point to them.

Nits:

T : The definition of Stateful Proxy and Stateless Proxy copied the words
"defined by this specification" from RFC3261. This literal copy introduces
confusion. Can you make it more visually obvious you are quoting? And even if
you do, could you replace "by this specification" with "by [RFC3261]"?

T, Introduction, 2nd paragraph, last sentence: This rules out stateless
proxies.

T, Section 3: In the places where this template is used, you are careful to say None under Issues when there aren't any, but not so careful to say None under See Also when there isn't anything. Leaving them blank makes some transitions
hard to read - they read like you are saying see also (whatever the next
section heading is).

T, 3.1.6, Discussion: s/tie interval/time interval/

M, Introduction, paragraph 2: You say "any [RFC3261] conforming device", but
you've ruled endpoint UAs out in other parts of the documents.

M 4.9: You have comments explaining send_traffic the _second_ time you use it.
They would be better positioned at the first use.

M 5.2: This is the first place the concept of re-Registration is mentioned. A
forward pointer to what you mean, or an introduction before you get to this
format would be clearer.


On 1/16/13 3:48 PM, The IESG wrote:
The IESG has received a request from the Benchmarking Methodology WG
(bmwg) to consider the following document:
- 'Terminology for Benchmarking Session Initiation Protocol (SIP)
    Networking Devices'
   <draft-ietf-bmwg-sip-bench-term-08.txt> as Informational RFC

The IESG plans to make a decision in the next few weeks, and solicits
final comments on this action. Please send substantive comments to the
ietf@ietf.org mailing lists by 2013-01-30. Exceptionally, comments may be
sent to i...@ietf.org instead. In either case, please retain the
beginning of the Subject line to allow automated sorting.

Abstract


    This document provides a terminology for benchmarking the SIP
    performance of networking devices.  The term performance in this
    context means the capacity of the device- or system-under-test to
    process SIP messages.  Terms are included for test components, test
    setup parameters, and performance benchmark metrics for black-box
    benchmarking of SIP networking devices.  The performance benchmark
    metrics are obtained for the SIP signaling plane only.  The terms are
    intended for use in a companion methodology document for
    characterizing the performance of a SIP networking device under a
    variety of conditions.  The intent of the two documents is to enable
    a comparison of the capacity of SIP networking devices.  Test setup
    parameters and a methodology document are necessary because SIP
    allows a wide range of configuration and operational conditions that
    can influence performance benchmark measurements.  A standard
    terminology and methodology will ensure that benchmarks have
    consistent definition and were obtained following the same
    procedures.




The file can be obtained via
http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/

IESG discussion can be tracked via
http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/ballot/


No IPR declarations have been submitted directly on this I-D.



Reply via email to