Hi folks,

----------------------------------------------------------------
This is an updated re-send of a message I originally sent about a year ago, 
during MeetBSD 2014. A few people expressed interest in person, but no one ever 
followed up on the lists. I'm bringing this up again, in the hopes that I can 
get some feedback before / during next week's Dev/Vendor Summit. I'm CCing some 
folks who expressed interest at that time, and some folks that I was told would 
be interested.
----------------------------------------------------------------

At Panasas, we found it useful to have a trace buffer for each ATA drive in the 
system, for gathering information about command sequences and latency. I 
implemented that functionality for our old 7-ish ATA driver, and it works quite 
well, with fairly low overhead. (I ran our usual suite of performance tests, 
and any differences were lost in the noise.) These traces allowed us to capture 
and replay access patterns (though obviously not with the same data), analyze 
and compare latency, etc.

As Panasas moves toward a more modern base, we want to re-implement this 
functionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-)

First, some example output; I wrote code for three different formats:
  - The binary trace buffer dump
    Useful for manipulating in C; the other two formats are post-processed out 
of this, and I also wrote a replay utility and a basic stats analyzer which 
operated on the binary dump.
  - Tabular hex
    Useful for parsing with scripts; I never actually used this format, but it 
was trivial so I just went ahead and wrote it.
  - Human-readable
    Useful for eyeballing; on more than one occasion, I was able to see that I 
had mis-assembled a command in userspace, etc.

These examples are actually really long lines; I've broken them up and indented 
for the sake of email:

    # Tabular hex format, for use w/ scripts
    # Request time       ATA request information   Response time        
        ATA response information  sts/err  flags
    1445839793003737     c8 0000 0004 000000002fa8 1445839793003806     
        c8 0000 0004 000000002fa8 50 00 0000007d
    1445839793003874     c8 0000 0020 000000113ac8 1445839793011541     
        c8 0000 0020 000000113ac8 50 00 0000007d
    1445839793011795     c8 0000 0008 000000118ca8 1445839793017702     
        c8 0000 0008 000000118ca8 50 00 0000007d
    1445839793017808     c8 0000 0080 000000117628 1445839793018975     
        c8 0000 0080 000000117628 50 00 0000007d
    1445839793019079     c8 0000 0020 0000001177a8 1445839793019686     
        c8 0000 0020 0000001177a8 50 00 0000007d

    # Human-readable
    FLAGS       REQUEST TIME         RESPONSE TIME        ELAPSED TIME         
        REQUEST COMMAND                          STATUS   ERROR    
        RESPONSE COMMAND                        
    ----------- -------------------- -------------------- -------------------- 
        ---------------------------------------- -------- -------- 
        ----------------------------------------
    ____C...._V 1445839793003737     1445839793003806     69                   
        READ DMA (LBA 12200 + 4 sectors)         _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793003874     1445839793011541     7667                 
        READ DMA (LBA 1129160 + 32 sectors)      _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793011795     1445839793017702     5907                 
        READ DMA (LBA 1150120 + 8 sectors)       _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793017808     1445839793018975     1167                 
        READ DMA (LBA 1144360 + 128 sectors)     _R_S____ ________ 
        ----                                    
    ____C...._V 1445839793019079     1445839793019686     607                  
        READ DMA (LBA 1144744 + 32 sectors)      _R_S____ ________ 
        ----                                    

The obvious place to put this sort of tracing would be in GEOM, but that 
actually isn't what I want, for a few reasons. First of all, I'm primarily 
interested in the exact register / taskfile / CDB values sent to the drive - 
opcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level 
latency - how long does a request stay in the controller and device. Both of 
those pretty much require working below GEOM.

Therefore, I think I want to do it in the CAM stack; ideally, I'd put the hooks 
in the SIM drivers, to get as close to the metal as possible, to get the most 
accurate latency. However, even a light touch to every SIM driver is going to 
be painful, and since I don't have access to every controller in the world, 
testing would be difficult. Adding the hooks to da(4) and ada(4) would still be 
pretty close to the metal, and would require much smaller changes. Or maybe 
they belong in the XPT layer? (I don't know nearly as much about CAM as I 
should. :-/)

--------
NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD User 
Group (BAFUG) meeting, there were suggestions about integrating this with 
DTrace or KTrace instead. I'm happy to consider either of those, but I know 
even less about those than I do about CAM. If that's really the best way to do 
this sort of thing, I'm happy to learn, if someone is willing to teach.
--------

I took the Panasas / 7.x implementation, and pseudo-coded what it might look 
like in CAM. (I could probably make diffs against the old ATA stack available 
for comparison, but I figured no one would care anymore.) A huge caveat is that 
I've never really poked in CAM before, so I fully expect I'm doing several 
things wrong. Please educate me. :-)

================================

Create a circular buffer of trace records for each CAM PERIPH. Each record will 
contain the full CDB for the request as sent to the device, the timestamp (in 
usec) of the request, the CDB as returned by the device, the timestamp of the 
response, and the status and error code. Additionally, flags are provided to 
track the state of the trace record, and to indicate which type of periph the 
record is associated with.

--------

typedef struct {
        u_int32_t       flags;
#define CAM_TRACE_F_VALID       0x00000001      /* valid record */
#define CAM_TRACE_F_INPROGRESS  0x00000002      /* request in progress */
#define CAM_TRACE_F_REQ_VALID   0x00000004      /* request data valid */
#define CAM_TRACE_F_RESP_VALID  0x00000008      /* response data valid */
#define CAM_TRACE_F_COMPLETE    0x00000010      /* response gathered */
#define CAM_TRACE_F_TIMEDOUT    0x00000020      /* request timed out */
#define CAM_TRACE_F_ABANDONED   0x00000040      /* cancelled, etc. */
#define CAM_TRACE_F_RETRIED     0x00000080      /* retry in later record */
#define CAM_TRACE_F_IS_RETRY    0x00000100      /* retry of earlier req. */
#define CAM_TRACE_F_PERIPH_MASK 0xf0000000      /* up to 16 PERIPH types */
#define CAM_TRACE_F_PERIPH_DA   0x0
#define CAM_TRACE_F_PERIPH_ADA  0x1
#define CAM_TRACE_F_PERIPH_NVD  0x2
        u_int64_t       req_usec;
        cdb_t           req_cdb;
        u_int64_t       resp_usec;
        cdb_t           resp_cdb;
        u_int8_t        resp_status;
        u_int8_t        resp_error;
        u_int8_t        resp_asc;       /* Needed for SCSI? */
        u_int8_t        resp_asq;       /* Needed for SCSI? */
        u_int8_t        padding[8];     /* Pad to 64 bytes */
/* There's going to be a large-ish array of these in the kernel; make sure
 * they're packed to minimize space usage, and keep things aligned. It may
 * make sense to re-order the fields for alignment as well.
 */
} cam_trace_record_t __attribute__((packed, aligned(8)));

typedef struct {
        u_int32_t               trace_buf_record_count;
        cam_trace_record_t      *trace_buf;
} cam_trace_buffer_t;

/* This is settable via a tunable; for Panasas' purposes, 100K entries was
 * enough to provide a good amount of history even on a busy system, while
 * not taking up much space (< 5MB/drive).
 */
#define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000

----------------

ioctl interfaces are used to get the trace buffer size, to retrieve the buffer 
contents, or to clear the buffer. The buffer can then be analyzed from 
userspace, or written to a file for post-processing.

--------

#define CAM_GET_TRACE_RECORD_COUNT      _IOR(???, ???, u_int32_t)
#define CAM_GET_TRACE_BUFFER            _IOWR(???, ???, cam_trace_buffer_t)
#define CAM_CLEAR_TRACE_BUFFER          _IOW(???, ???, u_int32_t)

----------------

To get the buffer:

--------

        cam_trace_buffer_t trace = NULL;
        fd = open(periph_name, O_RDONLY);
        error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
            &trace.trace_buf_record_count);
        trace.trace_buf = malloc(trace.trace_buf_record_count *
            sizeof(cam_trace_record_t));
        error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);

----------------

By including the full CDB, the entire command can be decoded by analysis tools. 
Tools can be written to parse the buffer and translate command data into 
human-readable descriptions, calculate elapsed times for requests, perform 
statistical analysis based on opcode, command type, transfer size, etc. Command 
decoders would need to be written for each PERIPH type (i.e. one for ATA, one 
for SCSI, one for NVMe, etc.).

----------------

Each PERIPH's softc needs to include some information about the per-device
buffer.

--------

        u_int32_t               trace_buf_record_count;
        u_int64_t               trace_buf_request_count;
        cam_trace_record_t      *trace_buf;

----------------

In each SIM, there should already be some sort of request structure, which 
associates the softc and the CDB of the request. To that structure, add a 
pointer to the trace record associated with the request.

--------

request structure:
        cam_trace_record_t      *trace_rec;

----------------

When initializing the per-device softc, allocate a trace buffer. The number of 
records can be hard-coded, or settable via a tunable.

The example below has a global size for everything, but it should be possible 
to do it on a per-device or per-PERIPH basis.

--------

        /** Initialize per-drive command tracing buffer. **/
        /* Get the buffer size from the tunable. Sanity check it. */
        sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT;
        TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
            &trace_buf_record_count_tunable);
        if (trace_buf_record_count_tunable) {
                /* Less than 1K records is unusably few; more than 1M
                 * records is unusably many.
                 */
                if ((trace_buf_record_count_tunable < (1 << 10)) ||
                    (trace_buf_record_count_tunable > (1 << 20))) {
                        device_printf(dev, "cam.trace_buf_record_count=%lu"
                            " out of range; should be between %u and %u\n",
                            trace_buf_record_count_tunable, (1 << 10),
                            (1 << 20));
                } else {
                        sc->trace_buf_record_count = (u_int32_t)
                            trace_buf_record_count_tunable;
                }
        }
        /* Allocate the buffer. */
        trace_buf_bytes = sc->trace_buf_record_count *
            sizeof(cam_trace_record_t);
        sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
        if (sc->trace_buf == NULL) {
                device_printf(dev, "unable to allocate %zd for trace_buf\n",
                    trace_buf_bytes);
                sc->trace_buf_record_count = 0;
        } else {
                device_printf(dev, "allocated %zd @ %p for trace_buf\n",
                    trace_buf_bytes, sc->trace_buf);
        }
        /* Request counter, used for indexing */
        sc->trace_buf_request_count = 0;


----------------

/* Initial setup of the trace record for a new request.
 *
 *  Locate the circular per-device trace buffer in the drive's softc.
 *  Populate the pointer to the next record in the trace buffer; this is
 *  done atomically to prevent concurrent requests from getting the same
 *  index. That could get a bit weird if/when 'trace_buf_request_count'
 *  rolls over, if the maximum counter value is not a multiple of
 *  'trace_buf_record_count' (i.e. the request counter recycles in the
 *  middle of the buffer). By using a 64-bit request counter, we're pretty
 *  safe - even at one transaction per nanosecond, that will still last
 *  about 585 years.
 *
 *  Initialize the record by zeroing it and then setting the VALID flag.
 *
 *  This could be done at any point before sending the request to the
 *  hardware; neither the CDB data nor the timestamps are involved yet.
 */
void
trace_rec_init(cam_request_t req) {
        ada_softc *sc = (request->...->softc)
        u_int64_t tmp = 0;
        if (sc && sc->trace_buf) {
                tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
                req->trace_rec = &sc->trace_buf[tmp %
                    sc->trace_buf_record_count];
                bzero(req->trace_rec, sizeof(*req->trace_rec));
                req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28);
                req->trace_rec->flags |= CAM_TRACE_F_VALID;
        } else {
                req->trace_rec = NULL;
        }
}

----------------

/* Stuff the request data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB into the
 *  record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
 *  valid, and INPROGRESS to show that the request is in-progress.
 *
 *  This should be done just before sending the request to the hardware.
 *  This should *not* be fused into trace_rec_init(), because
 *  trace_rec_init() is also used in the retry case, where it is not
 * correct to immediately start the timer.
 */
void
trace_rec_req(cam_request_t req) {
        struct timeval tv;
        if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
                bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
                microuptime(&tv);
                req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000)
                    + tv->tv_usec;
                req->trace_rec->flags |=
                    (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS);
        }
}

----------------

/* Stuff the response data into the trace record.
 *
 *  If the trace record is present and valid, copy the CDB **as returned by
 *  the drive** into the record. At this point, we also have the result from
 *  the command (status, error, (asc? ascq?)), so save them too. Set
 *  CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
 *  Clear INPROGRESS and set COMPLETE instead.
 *
 *  This should be done just after getting the response information from the
 *  hardware.
 */
void
trace_rec_resp(cam_request_t req) {
        struct timeval tv;
        if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
                microuptime(&tv);
                bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
                req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec *
                    1000000) + tv->tv_usec;
                req->trace_rec->flags |=
                    (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
                req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
        }
}

----------------

/* Update the trace record if there was a timeout.
 *
 *  If the trace record is present and valid, set TIMEDOUT to indicate
 *  that the request associated with the record exceeded its timeout.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *
 *  This should be done along with other timeout handling.
 */
void
trace_rec_timeout(cam_request_t req) {
        if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
                req->trace_rec->flags |=
                    (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
                req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
        }
}

----------------

/* Update the trace record if the request is being retried.
 *  
 *  If the trace record is present and valid, set RETRIED to indicate
 *  that the request associated with the record is going to be retried.
 *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
 *  Then get a new trace record for the new attempt, copy the request
 *  command data to the new record, and set IS_RETRY to show that this
 *  is a retry of an earlier request.
 *
 *  This should be done at the time the retry is queued.
 */
void
trace_rec_retry(cam_request_t req) {
        if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
                /* First, mark the original request as being requeued. */
                req->trace_rec->flags |=
                    (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
                req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
                /* Get a new record for the retry. */
                trace_rec_init(req);
                /* Mark the new record as being a retry. */
                req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY;
        }
}

================================

I'd appreciate feedback, both on the implementation details and on the idea as 
a whole. Some open questions are:

- My assumption is that each SIM has a structure that associates a PERIPH's 
softc with the CDB that's being processed. It might require dereferencing a 
chain of pointers, but it should be there in some fashion.

- The current implementation adds a few ioctls to the old ad(4) disk-device 
driver. I would expect to add the new ioctls to da(4) and ada(4); is that the 
right thing to do?

Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendor 
Summit next week, if anyone wants to discuss this with me in person.

Thanks,

Ravi
_______________________________________________
freebsd-geom@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-geom
To unsubscribe, send any mail to "freebsd-geom-unsubscr...@freebsd.org"

Reply via email to