Hi Ravi, Yes I was one of the people who never followed up. I'll be at the vendor summit next week, let's chat some more.
Scott Sent from my iPhone > On Oct 26, 2015, at 8:22 PM, Pokala, Ravi <rpok...@panasas.com> wrote: > > Hi folks, > > ---------------------------------------------------------------- > This is an updated re-send of a message I originally sent about a year ago, > during MeetBSD 2014. A few people expressed interest in person, but no one > ever followed up on the lists. I'm bringing this up again, in the hopes that > I can get some feedback before / during next week's Dev/Vendor Summit. I'm > CCing some folks who expressed interest at that time, and some folks that I > was told would be interested. > ---------------------------------------------------------------- > > At Panasas, we found it useful to have a trace buffer for each ATA drive in > the system, for gathering information about command sequences and latency. I > implemented that functionality for our old 7-ish ATA driver, and it works > quite well, with fairly low overhead. (I ran our usual suite of performance > tests, and any differences were lost in the noise.) These traces allowed us > to capture and replay access patterns (though obviously not with the same > data), analyze and compare latency, etc. > > As Panasas moves toward a more modern base, we want to re-implement this > functionality for ATA, SCSI, NVMe, etc. And push it upstream, of course! :-) > > First, some example output; I wrote code for three different formats: > - The binary trace buffer dump > Useful for manipulating in C; the other two formats are post-processed out > of this, and I also wrote a replay utility and a basic stats analyzer which > operated on the binary dump. > - Tabular hex > Useful for parsing with scripts; I never actually used this format, but it > was trivial so I just went ahead and wrote it. > - Human-readable > Useful for eyeballing; on more than one occasion, I was able to see that I > had mis-assembled a command in userspace, etc. > > These examples are actually really long lines; I've broken them up and > indented for the sake of email: > > # Tabular hex format, for use w/ scripts > # Request time ATA request information Response time > ATA response information sts/err flags > 1445839793003737 c8 0000 0004 000000002fa8 1445839793003806 > c8 0000 0004 000000002fa8 50 00 0000007d > 1445839793003874 c8 0000 0020 000000113ac8 1445839793011541 > c8 0000 0020 000000113ac8 50 00 0000007d > 1445839793011795 c8 0000 0008 000000118ca8 1445839793017702 > c8 0000 0008 000000118ca8 50 00 0000007d > 1445839793017808 c8 0000 0080 000000117628 1445839793018975 > c8 0000 0080 000000117628 50 00 0000007d > 1445839793019079 c8 0000 0020 0000001177a8 1445839793019686 > c8 0000 0020 0000001177a8 50 00 0000007d > > # Human-readable > FLAGS REQUEST TIME RESPONSE TIME ELAPSED TIME > REQUEST COMMAND STATUS ERROR > RESPONSE COMMAND > ----------- -------------------- -------------------- -------------------- > ---------------------------------------- -------- -------- > ---------------------------------------- > ____C...._V 1445839793003737 1445839793003806 69 > READ DMA (LBA 12200 + 4 sectors) _R_S____ ________ > ---- > ____C...._V 1445839793003874 1445839793011541 7667 > READ DMA (LBA 1129160 + 32 sectors) _R_S____ ________ > ---- > ____C...._V 1445839793011795 1445839793017702 5907 > READ DMA (LBA 1150120 + 8 sectors) _R_S____ ________ > ---- > ____C...._V 1445839793017808 1445839793018975 1167 > READ DMA (LBA 1144360 + 128 sectors) _R_S____ ________ > ---- > ____C...._V 1445839793019079 1445839793019686 607 > READ DMA (LBA 1144744 + 32 sectors) _R_S____ ________ > ---- > > The obvious place to put this sort of tracing would be in GEOM, but that > actually isn't what I want, for a few reasons. First of all, I'm primarily > interested in the exact register / taskfile / CDB values sent to the drive - > opcode, feature codes, LBAs, etc. Second, I'm interested in hardware-level > latency - how long does a request stay in the controller and device. Both of > those pretty much require working below GEOM. > > Therefore, I think I want to do it in the CAM stack; ideally, I'd put the > hooks in the SIM drivers, to get as close to the metal as possible, to get > the most accurate latency. However, even a light touch to every SIM driver is > going to be painful, and since I don't have access to every controller in the > world, testing would be difficult. Adding the hooks to da(4) and ada(4) would > still be pretty close to the metal, and would require much smaller changes. > Or maybe they belong in the XPT layer? (I don't know nearly as much about CAM > as I should. :-/) > > -------- > NOTE: When I mentioned this idea at a (San Francisco) Bay Area FreeBSD User > Group (BAFUG) meeting, there were suggestions about integrating this with > DTrace or KTrace instead. I'm happy to consider either of those, but I know > even less about those than I do about CAM. If that's really the best way to > do this sort of thing, I'm happy to learn, if someone is willing to teach. > -------- > > I took the Panasas / 7.x implementation, and pseudo-coded what it might look > like in CAM. (I could probably make diffs against the old ATA stack available > for comparison, but I figured no one would care anymore.) A huge caveat is > that I've never really poked in CAM before, so I fully expect I'm doing > several things wrong. Please educate me. :-) > > ================================ > > Create a circular buffer of trace records for each CAM PERIPH. Each record > will contain the full CDB for the request as sent to the device, the > timestamp (in usec) of the request, the CDB as returned by the device, the > timestamp of the response, and the status and error code. Additionally, flags > are provided to track the state of the trace record, and to indicate which > type of periph the record is associated with. > > -------- > > typedef struct { > u_int32_t flags; > #define CAM_TRACE_F_VALID 0x00000001 /* valid record */ > #define CAM_TRACE_F_INPROGRESS 0x00000002 /* request in progress */ > #define CAM_TRACE_F_REQ_VALID 0x00000004 /* request data valid */ > #define CAM_TRACE_F_RESP_VALID 0x00000008 /* response data valid */ > #define CAM_TRACE_F_COMPLETE 0x00000010 /* response gathered */ > #define CAM_TRACE_F_TIMEDOUT 0x00000020 /* request timed out */ > #define CAM_TRACE_F_ABANDONED 0x00000040 /* cancelled, etc. */ > #define CAM_TRACE_F_RETRIED 0x00000080 /* retry in later record */ > #define CAM_TRACE_F_IS_RETRY 0x00000100 /* retry of earlier req. */ > #define CAM_TRACE_F_PERIPH_MASK 0xf0000000 /* up to 16 PERIPH types */ > #define CAM_TRACE_F_PERIPH_DA 0x0 > #define CAM_TRACE_F_PERIPH_ADA 0x1 > #define CAM_TRACE_F_PERIPH_NVD 0x2 > u_int64_t req_usec; > cdb_t req_cdb; > u_int64_t resp_usec; > cdb_t resp_cdb; > u_int8_t resp_status; > u_int8_t resp_error; > u_int8_t resp_asc; /* Needed for SCSI? */ > u_int8_t resp_asq; /* Needed for SCSI? */ > u_int8_t padding[8]; /* Pad to 64 bytes */ > /* There's going to be a large-ish array of these in the kernel; make sure > * they're packed to minimize space usage, and keep things aligned. It may > * make sense to re-order the fields for alignment as well. > */ > } cam_trace_record_t __attribute__((packed, aligned(8))); > > typedef struct { > u_int32_t trace_buf_record_count; > cam_trace_record_t *trace_buf; > } cam_trace_buffer_t; > > /* This is settable via a tunable; for Panasas' purposes, 100K entries was > * enough to provide a good amount of history even on a busy system, while > * not taking up much space (< 5MB/drive). > */ > #define CAM_TRACE_BUFFER_DEFAULT_COUNT 100000 > > ---------------- > > ioctl interfaces are used to get the trace buffer size, to retrieve the > buffer contents, or to clear the buffer. The buffer can then be analyzed from > userspace, or written to a file for post-processing. > > -------- > > #define CAM_GET_TRACE_RECORD_COUNT _IOR(???, ???, u_int32_t) > #define CAM_GET_TRACE_BUFFER _IOWR(???, ???, cam_trace_buffer_t) > #define CAM_CLEAR_TRACE_BUFFER _IOW(???, ???, u_int32_t) > > ---------------- > > To get the buffer: > > -------- > > cam_trace_buffer_t trace = NULL; > fd = open(periph_name, O_RDONLY); > error = ioctl(fd, CAM_GET_TRACE_RECORD_COUNT, > &trace.trace_buf_record_count); > trace.trace_buf = malloc(trace.trace_buf_record_count * > sizeof(cam_trace_record_t)); > error = ioctl(fd, CAM_GET_TRACE_BUFFER, &trace); > > ---------------- > > By including the full CDB, the entire command can be decoded by analysis > tools. Tools can be written to parse the buffer and translate command data > into human-readable descriptions, calculate elapsed times for requests, > perform statistical analysis based on opcode, command type, transfer size, > etc. Command decoders would need to be written for each PERIPH type (i.e. one > for ATA, one for SCSI, one for NVMe, etc.). > > ---------------- > > Each PERIPH's softc needs to include some information about the per-device > buffer. > > -------- > > u_int32_t trace_buf_record_count; > u_int64_t trace_buf_request_count; > cam_trace_record_t *trace_buf; > > ---------------- > > In each SIM, there should already be some sort of request structure, which > associates the softc and the CDB of the request. To that structure, add a > pointer to the trace record associated with the request. > > -------- > > request structure: > cam_trace_record_t *trace_rec; > > ---------------- > > When initializing the per-device softc, allocate a trace buffer. The number > of records can be hard-coded, or settable via a tunable. > > The example below has a global size for everything, but it should be possible > to do it on a per-device or per-PERIPH basis. > > -------- > > /** Initialize per-drive command tracing buffer. **/ > /* Get the buffer size from the tunable. Sanity check it. */ > sc->trace_buf_record_count = CAM_TRACE_BUFFER_DEFAULT_COUNT; > TUNABLE_ULONG_FETCH("cam.trace_buf_record_count", > &trace_buf_record_count_tunable); > if (trace_buf_record_count_tunable) { > /* Less than 1K records is unusably few; more than 1M > * records is unusably many. > */ > if ((trace_buf_record_count_tunable < (1 << 10)) || > (trace_buf_record_count_tunable > (1 << 20))) { > device_printf(dev, "cam.trace_buf_record_count=%lu" > " out of range; should be between %u and %u\n", > trace_buf_record_count_tunable, (1 << 10), > (1 << 20)); > } else { > sc->trace_buf_record_count = (u_int32_t) > trace_buf_record_count_tunable; > } > } > /* Allocate the buffer. */ > trace_buf_bytes = sc->trace_buf_record_count * > sizeof(cam_trace_record_t); > sc->trace_buf = malloc(trace_buf_bytes, M_AD, M_NOWAIT | M_ZERO); > if (sc->trace_buf == NULL) { > device_printf(dev, "unable to allocate %zd for trace_buf\n", > trace_buf_bytes); > sc->trace_buf_record_count = 0; > } else { > device_printf(dev, "allocated %zd @ %p for trace_buf\n", > trace_buf_bytes, sc->trace_buf); > } > /* Request counter, used for indexing */ > sc->trace_buf_request_count = 0; > > > ---------------- > > /* Initial setup of the trace record for a new request. > * > * Locate the circular per-device trace buffer in the drive's softc. > * Populate the pointer to the next record in the trace buffer; this is > * done atomically to prevent concurrent requests from getting the same > * index. That could get a bit weird if/when 'trace_buf_request_count' > * rolls over, if the maximum counter value is not a multiple of > * 'trace_buf_record_count' (i.e. the request counter recycles in the > * middle of the buffer). By using a 64-bit request counter, we're pretty > * safe - even at one transaction per nanosecond, that will still last > * about 585 years. > * > * Initialize the record by zeroing it and then setting the VALID flag. > * > * This could be done at any point before sending the request to the > * hardware; neither the CDB data nor the timestamps are involved yet. > */ > void > trace_rec_init(cam_request_t req) { > ada_softc *sc = (request->...->softc) > u_int64_t tmp = 0; > if (sc && sc->trace_buf) { > tmp = atomic_fetchadd_log(&sc->trace_buf_request_count, 1); > req->trace_rec = &sc->trace_buf[tmp % > sc->trace_buf_record_count]; > bzero(req->trace_rec, sizeof(*req->trace_rec)); > req->trace_rec->flags = (CAM_TRACE_F_PERIPH_XXX << 28); > req->trace_rec->flags |= CAM_TRACE_F_VALID; > } else { > req->trace_rec = NULL; > } > } > > ---------------- > > /* Stuff the request data into the trace record. > * > * If the trace record is present and valid, copy the CDB into the > * record. Set CAM_TRACE_F_REQ_VALID to indicate that the request data is > * valid, and INPROGRESS to show that the request is in-progress. > * > * This should be done just before sending the request to the hardware. > * This should *not* be fused into trace_rec_init(), because > * trace_rec_init() is also used in the retry case, where it is not > * correct to immediately start the timer. > */ > void > trace_rec_req(cam_request_t req) { > struct timeval tv; > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > bcopy(req->cdb, req->trace_rec->req_cdb, sizeof(req->cdb)); > microuptime(&tv); > req->trace_rec->req_usec = ((uint64_t) tv->tv_sec * 1000000) > + tv->tv_usec; > req->trace_rec->flags |= > (CAM_TRACE_F_REQ_VALID | CAM_TRACE_F_INPROGRESS); > } > } > > ---------------- > > /* Stuff the response data into the trace record. > * > * If the trace record is present and valid, copy the CDB **as returned by > * the drive** into the record. At this point, we also have the result from > * the command (status, error, (asc? ascq?)), so save them too. Set > * CAM_TRACE_F_RESP_VALID to indicate that the response data is valid. > * Clear INPROGRESS and set COMPLETE instead. > * > * This should be done just after getting the response information from the > * hardware. > */ > void > trace_rec_resp(cam_request_t req) { > struct timeval tv; > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > microuptime(&tv); > bcopy(req->cdb, req->trace_rec->resp_cdb, sizeof(req->cdb)); > req->trace_rec->resp_usec = ((uint64_t) tv->tv_sec * > 1000000) + tv->tv_usec; > req->trace_rec->flags |= > (CAM_TRACE_F_RESP_VALID | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS; > } > } > > ---------------- > > /* Update the trace record if there was a timeout. > * > * If the trace record is present and valid, set TIMEDOUT to indicate > * that the request associated with the record exceeded its timeout. > * Since this attempt is complete, clear INPROGRESS and set COMPLETE. > * > * This should be done along with other timeout handling. > */ > void > trace_rec_timeout(cam_request_t req) { > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > req->trace_rec->flags |= > (CAM_TRACE_F_TIMEDOUT | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS; > } > } > > ---------------- > > /* Update the trace record if the request is being retried. > * > * If the trace record is present and valid, set RETRIED to indicate > * that the request associated with the record is going to be retried. > * Since this attempt is complete, clear INPROGRESS and set COMPLETE. > * Then get a new trace record for the new attempt, copy the request > * command data to the new record, and set IS_RETRY to show that this > * is a retry of an earlier request. > * > * This should be done at the time the retry is queued. > */ > void > trace_rec_retry(cam_request_t req) { > if (req->trace_rec && (req->trace_rec->flags & CAM_TRACE_F_VALID)) { > /* First, mark the original request as being requeued. */ > req->trace_rec->flags |= > (CAM_TRACE_F_RETRIED | CAM_TRACE_F_COMPLETE); > req->trace_rec->flags &= ~CAM_TRACE_F_INPROGRESS; > /* Get a new record for the retry. */ > trace_rec_init(req); > /* Mark the new record as being a retry. */ > req->trace_rec->flags |= CAM_TRACE_F_IS_RETRY; > } > } > > ================================ > > I'd appreciate feedback, both on the implementation details and on the idea > as a whole. Some open questions are: > > - My assumption is that each SIM has a structure that associates a PERIPH's > softc with the CDB that's being processed. It might require dereferencing a > chain of pointers, but it should be there in some fashion. > > - The current implementation adds a few ioctls to the old ad(4) disk-device > driver. I would expect to add the new ioctls to da(4) and ada(4); is that the > right thing to do? > > Again, any feedback would be greatly appreciated. I'll be at the Dev/Vendor > Summit next week, if anyone wants to discuss this with me in person. > > Thanks, > > Ravi > _______________________________________________ > freebsd-s...@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscr...@freebsd.org" _______________________________________________ freebsd-geom@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-geom To unsubscribe, send any mail to "freebsd-geom-unsubscr...@freebsd.org"