Pete, Your proposal for separating the I/O path from the metadata path using existing iSCSI kind of protocols sounds quite interesting and intriguing. Just to clarify my understanding and to also spark a discussion along these lines, I have jotted down my thoughts and please let me know if I understood your proposal completely.
- THis proposal calls for a split fast path I/O that will start out optional (possibly remain optional) since we don't know what the performance implications of this path is at scale. Presumably this can be tweaked to be a per-file/per-open option..? - mount & all metadata operations of non-opened files remain the same using either the existing client-core model or possibly the fuse alternative. - iscsi target mode code should be added to the servers to that they can service iSCSI PDU's. THis will need some fair amount of tweaking and could possibly leverage from Pete's recent OSD work. - on an open, we upcall to the client-core and fetch a list of the data file handles & BMI addresses of corresponding servers to the kernel. Assume for simplicity that we will only handle simple stripe distribution (round-robin) across all the servers. When we return to the kernel module, we send an iscsi login request to each of the data servers that are part of the striped file's backing. Once that is done, the call returns back to the caller. Consequently, every open of a file on PVFS will result in creation on "n" scsi initiator end-points where "n" is the # of data servers. (DOn't know what impact this will have on the scalability of the Linux SCSI stack/iSCSI initiator??) - DO we need to login each time? I think login can/should be made a one-time operation to each server. - any metadata operation involving an already opened file such as fstat, read, write etc should be mapped to a SCSI command, packetized and sent over the previously created iSCSI session's connection. Some of the offset calculations etc would therefore need to be moved to the kernel module. - Essentially, the bulk of the work involved is in presenting each data file handle on the server as a LUN. Since we do this only for opened files, this shouldn't be that big a scalability issue.?? - What do we respond to a REPORT LUNs command? See below on one possibility If as part of the open system call implementation, we send an out-of-band PVFS message (scalability...?) to intimate to the servers to add the corresponding data file handles to be eligible LUNs then we could report all those LUN ids.. When does the Linux ISCSI initiator stack send a report LUNs btw? At the end of the day, it looks like we will incur a heavy cost on open() to improve the cost of I/O which is ok, if we can do openg()/openfh() type of calls.. Did I understand your proposal correctly? Will this work? thoughts? thanks, Murali _______________________________________________ Pvfs2-developers mailing list Pvfs2-developers@beowulf-underground.org http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers