Stefan, Thanks for your review. I feel that iscsi would be beneficial for several usecases.
I have implemented all changes you pointed out, except two, and resent an updated patch to the list. Please see below. regards ronnie sahlberg On Mon, Sep 12, 2011 at 7:14 PM, Stefan Hajnoczi <stefa...@linux.vnet.ibm.com> wrote: > On Sat, Sep 10, 2011 at 02:23:30PM +1000, Ronnie Sahlberg wrote: > > Looking good. I think this is worth merging because it does offer > benefits over host iSCSI. > >> +static void >> +iscsi_abort_task_cb(struct iscsi_context *iscsi, int status, void >> *command_data, >> + void *private_data) >> +{ >> +} >> + >> +static void >> +iscsi_aio_cancel(BlockDriverAIOCB *blockacb) >> +{ >> + IscsiAIOCB *acb = (IscsiAIOCB *)blockacb; >> + IscsiLun *iscsilun = acb->iscsilun; >> + >> + acb->status = -ECANCELED; >> + acb->common.cb(acb->common.opaque, acb->status); >> + acb->canceled = 1; >> + >> + iscsi_task_mgmt_abort_task_async(iscsilun->iscsi, acb->task, >> + iscsi_abort_task_cb, NULL); >> +} > > The asynchronous abort task call looks odd. If a caller allocates a > buffer and issues a read request, then we need to make sure that the > request is really aborted by the time .bdrv_aio_cancel() returns. > > If I understand the code correctly, iscsi_aio_cancel() returns > immediately but the read request will still be in progress. That means > the caller could now free their data buffer and the read request will > overwrite that unallocated memory. You are right. I have added a new function to libiscsi that will also release the task structure from libiscsi as well. So we should no longer have a window where we might have a cancelled task in flight writing the data to an already released buffer once the cancelled data buffers start arriving on the socket. > >> +static void >> +iscsi_aio_write10_cb(struct iscsi_context *iscsi, int status, >> + void *command_data, void *opaque) >> +{ >> + IscsiAIOCB *acb = opaque; >> + >> + trace_iscsi_aio_write10_cb(iscsi, status, acb, acb->canceled); >> + >> + if (acb->buf != NULL) { >> + free(acb->buf); >> + } > > Please just use g_free(acb->buf). g_free(NULL) is defined as a nop so > the check isn't necessary. Also, this code uses free(3) when it shoulde. > use g_free(3). > Done. >> + >> + if (acb->canceled != 0) { >> + qemu_aio_release(acb); >> + scsi_free_scsi_task(acb->task); >> + acb->task = NULL; >> + return; >> + } >> + >> + acb->status = 0; >> + if (status < 0) { >> + error_report("Failed to write10 data to iSCSI lun. %s", >> + iscsi_get_error(iscsi)); >> + acb->status = -EIO; >> + } >> + >> + iscsi_schedule_bh(iscsi_readv_writev_bh_cb, acb); >> + scsi_free_scsi_task(acb->task); >> + acb->task = NULL; >> +} >> + >> +static int64_t sector_qemu2lun(int64_t sector, IscsiLun *iscsilun) >> +{ >> + return sector * BDRV_SECTOR_SIZE / iscsilun->block_size; >> +} >> + >> +static BlockDriverAIOCB * >> +iscsi_aio_writev(BlockDriverState *bs, int64_t sector_num, >> + QEMUIOVector *qiov, int nb_sectors, >> + BlockDriverCompletionFunc *cb, >> + void *opaque) >> +{ >> + IscsiLun *iscsilun = bs->opaque; >> + struct iscsi_context *iscsi = iscsilun->iscsi; >> + IscsiAIOCB *acb; >> + size_t size; >> + int fua = 0; >> + >> + /* set FUA on writes when cache mode is write through */ >> + if (!(bs->open_flags & (BDRV_O_CACHE_WB|BDRV_O_NOCACHE))) { >> + fua = 1; >> + } > > FUA needs to reflect the guest semantics - does this disk have an > enabled write cache? When bs->open_flags has BDRV_O_CACHE_WB, then the > guest knows it needs to send flushes because there is a write cache: > > if (!(bs->open_flags & BDRV_O_CACHE_WB)) { > fua = 1; > } > > BDRV_O_NOCACHE is just for local files and sets the O_DIRECT hint. It > doesn't affect the cache semantics that the guest sees. > Done. When I first started building the patch a while ago, both flags were needed. I missed when the second flag became obsolete upstream. >> +/* >> + * We support iscsi url's on the form >> + * iscsi://[<username>%<password>@]<host>[:<port>]/<targetname>/<lun> >> + */ >> +static int iscsi_open(BlockDriverState *bs, const char *filename, int flags) >> +{ >> + IscsiLun *iscsilun = bs->opaque; >> + struct iscsi_context *iscsi = NULL; >> + struct iscsi_url *iscsi_url = NULL; >> + struct IscsiTask task; >> + int ret; >> + >> + if ((BDRV_SECTOR_SIZE % 512) != 0) { >> + error_report("iSCSI: Invalid BDRV_SECTOR_SIZE. " >> + "BDRV_SECTOR_SIZE(%lld) is not a multiple " >> + "of 512", BDRV_SECTOR_SIZE); >> + return -EINVAL; >> + } > > Another way of saying this is: > QEMU_BUILD_BUG_ON(BDRV_SECTOR_SIZE % 512 != 0); > > The advantage is that the build fails instead of waiting until iscsi is > used at runtime until the failure is detected. > > What will happen if BDRV_SECTOR_SIZE is not a multiple of 512? > I did not do this change. If blocksize is not multiple of 512, the iscsi backend will just refuse to work since having a read-modify-write cycle across the network would be extremely costly. I dont think iscsi should cause the entire build to fail. While it is difficult to imagine a different blocksize, it is not impossible I guess. If someone comes up with a non-512-multiple blocksize and a good reason for one, I dont want to be the guy that broke the build. >> + >> + memset(iscsilun, 0, sizeof(IscsiLun)); >> + >> + /* Should really append the KVM name after the ':' here */ >> + iscsi = iscsi_create_context("iqn.2008-11.org.linux-kvm:"); >> + if (iscsi == NULL) { >> + error_report("iSCSI: Failed to create iSCSI context."); >> + ret = -ENOMEM; >> + goto failed; >> + } >> + >> + iscsi_url = iscsi_parse_full_url(iscsi, filename); >> + if (iscsi_url == NULL) { >> + error_report("Failed to parse URL : %s %s", filename, >> + iscsi_get_error(iscsi)); >> + ret = -ENOMEM; > > -EINVAL? Done. > >> +static BlockDriver bdrv_iscsi = { >> + .format_name = "iscsi", >> + .protocol_name = "iscsi", >> + >> + .instance_size = sizeof(IscsiLun), >> + .bdrv_file_open = iscsi_open, >> + .bdrv_close = iscsi_close, >> + >> + .bdrv_getlength = iscsi_getlength, >> + >> + .bdrv_aio_readv = iscsi_aio_readv, >> + .bdrv_aio_writev = iscsi_aio_writev, >> + .bdrv_aio_flush = iscsi_aio_flush, > > block.c does not emulate .bdrv_flush() using your .bdrv_aio_flush() > implementation. I have sent a patch to add this emulation. This will > turn bdrv_flush(iscsi_bs) into an actual operation instead of a nop :). I did not implement this since I understood from the discussion that a patch is imminent and it is required for existing backend(s?) anyway. > >> diff --git a/trace-events b/trace-events >> index 3fdd60f..4e461e5 100644 >> --- a/trace-events >> +++ b/trace-events >> @@ -494,3 +494,10 @@ escc_sunkbd_event_in(int ch) "Untranslated keycode >> %2.2x" >> escc_sunkbd_event_out(int ch) "Translated keycode %2.2x" >> escc_kbd_command(int val) "Command %d" >> escc_sunmouse_event(int dx, int dy, int buttons_state) "dx=%d dy=%d >> buttons=%01x" >> + >> +# block/iscsi.c >> +disable iscsi_aio_write10_cb(void *iscsi, int status, void *acb, int >> canceled) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_writev(void *iscsi, int64_t sector_num, int nb_sectors, >> void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d opaque >> %p acb %p" >> +disable iscsi_aio_read10_cb(void *iscsi, int status, void *acb, int >> canceled) "iscsi %p status %d acb %p canceled %d" >> +disable iscsi_aio_readv(void *iscsi, int64_t sector_num, int nb_sectors, >> void *opaque, void *acb) "iscsi %p sector_num %"PRId64" nb_sectors %d opaque >> %p acb %p" > > It is no longer necessary to prefix trace event definitions with > "disable". Done. > > The stderr and simple backends now start up with all events disabled by > default. The set of events can be set using the -trace > events=<events-file> option or on the QEMU monitor using set trace-event > <name> on. > > Stefan >