Hi Ravi,

Yes I was one of the people who never followed up.  I'll be at the vendor 
summit next week, let's chat some more.

Scott

Sent from my iPhone

> On Oct 26, 2015, at 8:22 PM, Pokala, Ravi <rpok...@panasas.com> wrote:
> 
> Hi folks,
> 
> ----------------------------------------------------------------
> This is an updated re-send of a message I originally sent about a year ago, 
> during MeetBSD 2014. A few people expressed interest in person, but no one 
> ever followed up on the lists. I'm bringing this up again, in the hopes that 
> I can get some feedback before / during next week's Dev/Vendor Summit. I'm 
> CCing some folks who expressed interest at that time, and some folks that I 
> was told would be interested.
> ----------------------------------------------------------------
> 
> At Panasas, we found it useful to have a trace buffer for each ATA drive in 
> the system, for gathering information about command sequences and latency. I 
> implemented that functionality for our old 7-ish ATA driver, and it works 
> quite well, with fairly low overhead. (I ran our usual suite of performance 
> tests, and any differences were lost in the noise.) These traces allowed us 
> to capture and replay access patterns (though obviously not with the same 
> data), analyze and compare latency, etc.
> 
> As Panasas moves toward a more modern base, we want to re-implement this 
> functionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-)
> 
> First, some example output; I wrote code for three different formats:
>  - The binary trace buffer dump
>    Useful for manipulating in C; the other two formats are post-processed out 
> of this, and I also wrote a replay utility and a basic stats analyzer which 
> operated on the binary dump.
>  - Tabular hex
>    Useful for parsing with scripts; I never actually used this format, but it 
> was trivial so I just went ahead and wrote it.
>  - Human-readable
>    Useful for eyeballing; on more than one occasion, I was able to see that I 
> had mis-assembled a command in userspace, etc.
> 
> These examples are actually really long lines; I've broken them up and 
> indented for the sake of email:
> 
>    # Tabular hex format, for use w/ scripts
>    # Request time       ATA request information   Response time        
>        ATA response information  sts/err  flags
>    1445839793003737     c8 0000 0004 000000002fa8 1445839793003806     
>        c8 0000 0004 000000002fa8 50 00 0000007d
>    1445839793003874     c8 0000 0020 000000113ac8 1445839793011541     
>        c8 0000 0020 000000113ac8 50 00 0000007d
>    1445839793011795     c8 0000 0008 000000118ca8 1445839793017702     
>        c8 0000 0008 000000118ca8 50 00 0000007d
>    1445839793017808     c8 0000 0080 000000117628 1445839793018975     
>        c8 0000 0080 000000117628 50 00 0000007d
>    1445839793019079     c8 0000 0020 0000001177a8 1445839793019686     
>        c8 0000 0020 0000001177a8 50 00 0000007d
> 
>    # Human-readable
>    FLAGS       REQUEST TIME         RESPONSE TIME        ELAPSED TIME         
>        REQUEST COMMAND                          STATUS   ERROR    
>        RESPONSE COMMAND                        
>    ----------- -------------------- -------------------- -------------------- 
>        ---------------------------------------- -------- -------- 
>        ----------------------------------------
>    ____C...._V 1445839793003737     1445839793003806     69                   
>        READ DMA (LBA 12200 + 4 sectors)         _R_S____ ________ 
>        ----                                    
>    ____C...._V 1445839793003874     1445839793011541     7667                 
>        READ DMA (LBA 1129160 + 32 sectors)      _R_S____ ________ 
>        ----                                    
>    ____C...._V 1445839793011795     1445839793017702     5907                 
>        READ DMA (LBA 1150120 + 8 sectors)       _R_S____ ________ 
>        ----                                    
>    ____C...._V 1445839793017808     1445839793018975     1167                 
>        READ DMA (LBA 1144360 + 128 sectors)     _R_S____ ________ 
>        ----                                    
>    ____C...._V 1445839793019079     1445839793019686     607                  
>        READ DMA (LBA 1144744 + 32 sectors)      _R_S____ ________ 
>        ----                                    
> 
> The obvious place to put this sort of tracing would be in GEOM, but that 
> actually isn't what I want, for a few reasons. First of all, I'm primarily 
> interested in the exact register / taskfile / CDB values sent to the drive - 
> opcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level 
> latency - how long does a request stay in the controller and device. Both of 
> those pretty much require working below GEOM.
> 
> Therefore, I think I want to do it in the CAM stack; ideally, I'd put the 
> hooks in the SIM drivers, to get as close to the metal as possible, to get 
> the most accurate latency. However, even a light touch to every SIM driver is 
> going to be painful, and since I don't have access to every controller in the 
> world, testing would be difficult. Adding the hooks to da(4) and ada(4) would 
> still be pretty close to the metal, and would require much smaller changes. 
> Or maybe they belong in the XPT layer? (I don't know nearly as much about CAM 
> as I should. :-/)
> 
> --------
> NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD User 
> Group (BAFUG) meeting, there were suggestions about integrating this with 
> DTrace or KTrace instead. I'm happy to consider either of those, but I know 
> even less about those than I do about CAM. If that's really the best way to 
> do this sort of thing, I'm happy to learn, if someone is willing to teach.
> --------
> 
> I took the Panasas / 7.x implementation, and pseudo-coded what it might look 
> like in CAM. (I could probably make diffs against the old ATA stack available 
> for comparison, but I figured no one would care anymore.) A huge caveat is 
> that I've never really poked in CAM before, so I fully expect I'm doing 
> several things wrong. Please educate me. :-)
> 
> ================================
> 
> Create a circular buffer of trace records for each CAM PERIPH. Each record 
> will contain the full CDB for the request as sent to the device, the 
> timestamp (in usec) of the request, the CDB as returned by the device, the 
> timestamp of the response, and the status and error code. Additionally, flags 
> are provided to track the state of the trace record, and to indicate which 
> type of periph the record is associated with.
> 
> --------
> 
> typedef struct {
>    u_int32_t    flags;
> #define CAM_TRACE_F_VALID    0x00000001    /* valid record */
> #define CAM_TRACE_F_INPROGRESS    0x00000002    /* request in progress */
> #define CAM_TRACE_F_REQ_VALID    0x00000004    /* request data valid */
> #define CAM_TRACE_F_RESP_VALID    0x00000008    /* response data valid */
> #define CAM_TRACE_F_COMPLETE    0x00000010    /* response gathered */
> #define CAM_TRACE_F_TIMEDOUT    0x00000020    /* request timed out */
> #define CAM_TRACE_F_ABANDONED    0x00000040    /* cancelled, etc. */
> #define CAM_TRACE_F_RETRIED    0x00000080    /* retry in later record */
> #define CAM_TRACE_F_IS_RETRY    0x00000100    /* retry of earlier req. */
> #define CAM_TRACE_F_PERIPH_MASK    0xf0000000    /* up to 16 PERIPH types */
> #define CAM_TRACE_F_PERIPH_DA    0x0
> #define CAM_TRACE_F_PERIPH_ADA    0x1
> #define CAM_TRACE_F_PERIPH_NVD    0x2
>    u_int64_t    req_usec;
>    cdb_t        req_cdb;
>    u_int64_t    resp_usec;
>    cdb_t        resp_cdb;
>    u_int8_t    resp_status;
>    u_int8_t    resp_error;
>    u_int8_t    resp_asc;    /* Needed for SCSI? */
>    u_int8_t    resp_asq;    /* Needed for SCSI? */
>    u_int8_t    padding[8];    /* Pad to 64 bytes */
> /* There's going to be a large-ish array of these in the kernel; make sure
> * they're packed to minimize space usage, and keep things aligned. It may
> * make sense to re-order the fields for alignment as well.
> */
> } cam_trace_record_t __attribute__((packed, aligned(8)));
> 
> typedef struct {
>    u_int32_t        trace_buf_record_count;
>    cam_trace_record_t    *trace_buf;
> } cam_trace_buffer_t;
> 
> /* This is settable via a tunable; for Panasas' purposes, 100K entries was
> * enough to provide a good amount of history even on a busy system, while
> * not taking up much space (< 5MB/drive).
> */
> #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000
> 
> ----------------
> 
> ioctl interfaces are used to get the trace buffer size, to retrieve the 
> buffer contents, or to clear the buffer. The buffer can then be analyzed from 
> userspace, or written to a file for post-processing.
> 
> --------
> 
> #define CAM_GET_TRACE_RECORD_COUNT    _IOR(???, ???, u_int32_t)
> #define CAM_GET_TRACE_BUFFER        _IOWR(???, ???, cam_trace_buffer_t)
> #define CAM_CLEAR_TRACE_BUFFER        _IOW(???, ???, u_int32_t)
> 
> ----------------
> 
> To get the buffer:
> 
> --------
> 
>    cam_trace_buffer_t trace = NULL;
>    fd = open(periph_name, O_RDONLY);
>    error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT,
>        &trace.trace_buf_record_count);
>    trace.trace_buf = malloc(trace.trace_buf_record_count *
>        sizeof(cam_trace_record_t));
>    error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace);
> 
> ----------------
> 
> By including the full CDB, the entire command can be decoded by analysis 
> tools. Tools can be written to parse the buffer and translate command data 
> into human-readable descriptions, calculate elapsed times for requests, 
> perform statistical analysis based on opcode, command type, transfer size, 
> etc. Command decoders would need to be written for each PERIPH type (i.e. one 
> for ATA, one for SCSI, one for NVMe, etc.).
> 
> ----------------
> 
> Each PERIPH's softc needs to include some information about the per-device
> buffer.
> 
> --------
> 
>    u_int32_t        trace_buf_record_count;
>    u_int64_t        trace_buf_request_count;
>    cam_trace_record_t    *trace_buf;
> 
> ----------------
> 
> In each SIM, there should already be some sort of request structure, which 
> associates the softc and the CDB of the request. To that structure, add a 
> pointer to the trace record associated with the request.
> 
> --------
> 
> request structure:
>    cam_trace_record_t    *trace_rec;
> 
> ----------------
> 
> When initializing the per-device softc, allocate a trace buffer. The number 
> of records can be hard-coded, or settable via a tunable.
> 
> The example below has a global size for everything, but it should be possible 
> to do it on a per-device or per-PERIPH basis.
> 
> --------
> 
>    /** Initialize per-drive command tracing buffer. **/
>    /* Get the buffer size from the tunable. Sanity check it. */
>    sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT;
>    TUNABLE_ULONG_FETCH("cam.trace_buf_record_count",
>        &trace_buf_record_count_tunable);
>    if (trace_buf_record_count_tunable) {
>        /* Less than 1K records is unusably few; more than 1M
>         * records is unusably many.
>         */
>        if ((trace_buf_record_count_tunable < (1 << 10)) ||
>            (trace_buf_record_count_tunable > (1 << 20))) {
>            device_printf(dev, "cam.trace_buf_record_count=%lu"
>                " out of range; should be between %u and %u\n",
>                trace_buf_record_count_tunable, (1 << 10),
>                (1 << 20));
>        } else {
>            sc->trace_buf_record_count = (u_int32_t)
>                trace_buf_record_count_tunable;
>        }
>    }
>    /* Allocate the buffer. */
>    trace_buf_bytes = sc->trace_buf_record_count *
>        sizeof(cam_trace_record_t);
>    sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO);
>    if (sc->trace_buf == NULL) {
>        device_printf(dev, "unable to allocate %zd for trace_buf\n",
>            trace_buf_bytes);
>        sc->trace_buf_record_count = 0;
>    } else {
>        device_printf(dev, "allocated %zd @ %p for trace_buf\n",
>            trace_buf_bytes, sc->trace_buf);
>    }
>    /* Request counter, used for indexing */
>    sc->trace_buf_request_count = 0;
> 
> 
> ----------------
> 
> /* Initial setup of the trace record for a new request.
> *
> *  Locate the circular per-device trace buffer in the drive's softc.
> *  Populate the pointer to the next record in the trace buffer; this is
> *  done atomically to prevent concurrent requests from getting the same
> *  index. That could get a bit weird if/when 'trace_buf_request_count'
> *  rolls over, if the maximum counter value is not a multiple of
> *  'trace_buf_record_count' (i.e. the request counter recycles in the
> *  middle of the buffer). By using a 64-bit request counter, we're pretty
> *  safe - even at one transaction per nanosecond, that will still last
> *  about 585 years.
> *
> *  Initialize the record by zeroing it and then setting the VALID flag.
> *
> *  This could be done at any point before sending the request to the
> *  hardware; neither the CDB data nor the timestamps are involved yet.
> */
> void
> trace_rec_init(cam_request_t req) {
>    ada_softc *sc = (request->...->softc)
>    u_int64_t tmp = 0;
>    if (sc && sc->trace_buf) {
>        tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1);
>        req->trace_rec = &sc->trace_buf[tmp %
>            sc->trace_buf_record_count];
>        bzero(req->trace_rec, sizeof(*req->trace_rec));
>        req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28);
>        req->trace_rec->flags |= CAM_TRACE_F_VALID;
>    } else {
>        req->trace_rec = NULL;
>    }
> }
> 
> ----------------
> 
> /* Stuff the request data into the trace record.
> *
> *  If the trace record is present and valid, copy the CDB into the
> *  record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is
> *  valid, and INPROGRESS to show that the request is in-progress.
> *
> *  This should be done just before sending the request to the hardware.
> *  This should *not* be fused into trace_rec_init(), because
> *  trace_rec_init() is also used in the retry case, where it is not
> * correct to immediately start the timer.
> */
> void
> trace_rec_req(cam_request_t req) {
>    struct timeval tv;
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb));
>        microuptime(&tv);
>        req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000)
>            + tv->tv_usec;
>        req->trace_rec->flags |=
>            (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS);
>    }
> }
> 
> ----------------
> 
> /* Stuff the response data into the trace record.
> *
> *  If the trace record is present and valid, copy the CDB **as returned by
> *  the drive** into the record. At this point, we also have the result from
> *  the command (status, error, (asc? ascq?)), so save them too. Set
> *  CAM_TRACE_F_RESP_VALID to indicate that the response data is valid.
> *  Clear INPROGRESS and set COMPLETE instead.
> *
> *  This should be done just after getting the response information from the
> *  hardware.
> */
> void
> trace_rec_resp(cam_request_t req) {
>    struct timeval tv;
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        microuptime(&tv);
>        bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb));
>        req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec *
>            1000000) + tv->tv_usec;
>        req->trace_rec->flags |=
>            (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
>    }
> }
> 
> ----------------
> 
> /* Update the trace record if there was a timeout.
> *
> *  If the trace record is present and valid, set TIMEDOUT to indicate
> *  that the request associated with the record exceeded its timeout.
> *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> *
> *  This should be done along with other timeout handling.
> */
> void
> trace_rec_timeout(cam_request_t req) {
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        req->trace_rec->flags |=
>            (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
>    }
> }
> 
> ----------------
> 
> /* Update the trace record if the request is being retried.
> *  
> *  If the trace record is present and valid, set RETRIED to indicate
> *  that the request associated with the record is going to be retried.
> *  Since this attempt is complete, clear INPROGRESS and set COMPLETE.
> *  Then get a new trace record for the new attempt, copy the request
> *  command data to the new record, and set IS_RETRY to show that this
> *  is a retry of an earlier request.
> *
> *  This should be done at the time the retry is queued.
> */
> void
> trace_rec_retry(cam_request_t req) {
>    if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) {
>        /* First, mark the original request as being requeued. */
>        req->trace_rec->flags |=
>            (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE);
>        req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS;
>        /* Get a new record for the retry. */
>        trace_rec_init(req);
>        /* Mark the new record as being a retry. */
>        req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY;
>    }
> }
> 
> ================================
> 
> I'd appreciate feedback, both on the implementation details and on the idea 
> as a whole. Some open questions are:
> 
> - My assumption is that each SIM has a structure that associates a PERIPH's 
> softc with the CDB that's being processed. It might require dereferencing a 
> chain of pointers, but it should be there in some fashion.
> 
> - The current implementation adds a few ioctls to the old ad(4) disk-device 
> driver. I would expect to add the new ioctls to da(4) and ada(4); is that the 
> right thing to do?
> 
> Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendor 
> Summit next week, if anyone wants to discuss this with me in person.
> 
> Thanks,
> 
> Ravi
> _______________________________________________
> freebsd-s...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscr...@freebsd.org"
_______________________________________________
freebsd-geom@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-geom
To unsubscribe, send any mail to "freebsd-geom-unsubscr...@freebsd.org"

Reply via email to