Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Thu, Feb 07, 2013 at 02:12:57PM -0500, Martin K. Petersen wrote: Joel == Joel Becker jl...@evilplan.org writes: Joel I'm happy to chat about it. Unfortunately, like Darrick says, Joel sys_dio() coding hasn't happened. I do think we're better off Joel with some kind of explicit API than some magic state on the file. Joel I mean, even something like: Joel ssize_t write_with_pi(int fd, const void *buf, size_t count, Joel const void *pi, size_t pi_count); Joel It's not as nice as a non-historical API (eg sys_dio), but it also Joel probably plays nicer with buffered I/O. Pretty much everyone I have talked to that are interested in explicitly attaching PI (as opposed to relying on the kernel doing it) are using Linux aio. I am not opposed to having more read()/write() like interface as well. But I think it's important to cater to the I/O paradigm used by the applications interested in this. It's a lot easier to tweak a few IOCB fields than it is to rewrite how an application does I/O. You know I'm not going to argue with this. I was merely stating that I'm flexible in how we start :-) Joel -- Martin K. PetersenOracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Depend on the rabbit's foot if you will, but remember, it didn't help the rabbit. - R. E. Shay http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Thu, Feb 07, 2013 at 04:04:36PM -0500, J. Bruce Fields wrote: On Thu, Feb 07, 2013 at 09:36:39AM -0800, Joel Becker wrote: Dear LSF committee, I'd like to explicitly request attendance for this discussion :-) http://marc.info/?l=linux-fsdevelm=135894412908342w=2 Also, the way I compile the list of requests is from thread heads ... that means don't send your attendee request as a reply to something else either otherwise it might get missed. Ack. Send as such. Thanks, Joel --b. Joel On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote: On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote: Boaz Harrosh bharr...@panasas.com writes: For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Sure, it's easy to start there, but then you eventually end up having to add a non-aio interface as well. Let's not take the latter off the table. I agree that a sync variant should't be ignored, but needing a sync interface with PI arguments also shouldn't get in the way of adding support to the aio+dio path. Simply because it's what people use :/. I'm not sure how that's directly related to aio, but ok. If we're going to rewrite the aio code, I think Zach's acall would be a good start, at least on the API front: http://lwn.net/Articles/316806/ Yeah, I'm happy to chat about this stuff if people are interested. I think I'd do things differently today than what was done in that aged acall prototype. - z -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- You can get more with a kind word and a gun than you can with a kind word alone. - Al Capone http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- You look in her eyes, the music begins to play. Hopeless romantics, here we go again. http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. Joel Well, I hope I'll scrape together the time to hack together a PoC before LSF... on the other hand, I ran the discussion about PI userland interfaces at LPC2011 and (shamefully) haven't done anything yet. end rambling --D Regards, Ben -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- I think it would be a good idea. - Mahatma Ghandi, when asked what he thought of Western civilization http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote: On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); --D Joel Well, I hope I'll scrape together the time to hack together a PoC before LSF... on the other hand, I ran the discussion about PI userland interfaces at LPC2011 and (shamefully) haven't done anything yet. end rambling --D Regards, Ben -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- I think it would be a good idea. - Mahatma Ghandi, when asked what he thought of Western civilization http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 11:01 AM, Darrick J. Wong wrote: On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote: On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); This is also what I've envisioned. Updating io_prep / async I/O is reasonably easy as its been using a separate structure for passing in the I/O details. Normal read/write calls don't really map as you simply don't have enough parameter to feed PI information into the kernel. So for that you'd need to invent a new interface / syscall. For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 01:27 PM, Hannes Reinecke wrote: On 02/07/2013 11:01 AM, Darrick J. Wong wrote: On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote: On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); This is also what I've envisioned. Updating io_prep / async I/O is reasonably easy as its been using a separate structure for passing in the I/O details. Normal read/write calls don't really map as you simply don't have enough parameter to feed PI information into the kernel. So for that you'd need to invent a new interface / syscall. For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Me too, in multiple fronts. It's part of my general concern about things we would like for user-mode servers I think that the current aio and libaio Interface is broken for a long time, for multitude of reasons. For instance the nested structure definitions are COMPAT broken, and lots of missing pieces. (For example search in archives for why bsg does not support sg-lists.) And there are all these additions that everyone wants on top, that call for a new interface anyway. So I would like to see a deep fixup of this interface, with an aio version2 that can take into considerations, all of future needs including these above. Kernel code will be very happy to be implemented with the new, interface and a COMPAT layer could be put in place for the old interface. All interested parties should bring to the table what is the extension/changes they need. And we can try and union all of them together. (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) Thanks Boaz Cheers, Hannes -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 02:08 PM, Boaz Harrosh wrote: On 02/07/2013 01:27 PM, Hannes Reinecke wrote: On 02/07/2013 11:01 AM, Darrick J. Wong wrote: On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote: On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); This is also what I've envisioned. Updating io_prep / async I/O is reasonably easy as its been using a separate structure for passing in the I/O details. Normal read/write calls don't really map as you simply don't have enough parameter to feed PI information into the kernel. So for that you'd need to invent a new interface / syscall. For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Me too, in multiple fronts. It's part of my general concern about things we would like for user-mode servers I think that the current aio and libaio Interface is broken for a long time, for multitude of reasons. For instance the nested structure definitions are COMPAT broken, and lots of missing pieces. (For example search in archives for why bsg does not support sg-lists.) And there are all these additions that everyone wants on top, that call for a new interface anyway. So I would like to see a deep fixup of this interface, with an aio version2 that can take into considerations, all of future needs including these above. Kernel code will be very happy to be implemented with the new, interface and a COMPAT layer could be put in place for the old interface. All interested parties should bring to the table what is the extension/changes they need. And we can try and union all of them together. (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) I wanted to add that there is another LSF/MM thread going on about: [LSF TOPIC] What to do about O_DIRECT? All these guys should be participating here, so to change core structures and behavior to a better model, that helps us here, and not against us. (Again libaio should be changed in concert with Kernel's new API, and we can sacrifice old user-mode
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/13 13:08, Boaz Harrosh wrote: (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) Do you think it would help / make sense if sg_alloc_table() would be modified such that it allocates the entire scatterlist table via one vmalloc() call instead of chaining several page-sized scatterlist tables ? Note: such a change is not possible without modifying scsi_alloc_sgtable(). Bart. -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 01:16 PM, Boaz Harrosh wrote: On 02/07/2013 02:08 PM, Boaz Harrosh wrote: On 02/07/2013 01:27 PM, Hannes Reinecke wrote: On 02/07/2013 11:01 AM, Darrick J. Wong wrote: On Thu, Feb 07, 2013 at 01:40:14AM -0800, Joel Becker wrote: On Wed, Feb 06, 2013 at 03:34:49PM -0500, Chuck Lever wrote: On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. I'm happy to chat about it. Unfortunately, like Darrick says, sys_dio() coding hasn't happened. I do think we're better off with some kind of explicit API than some magic state on the file. I mean, even something like: ssize_t write_with_pi(int fd, const void *buf, size_t count, const void *pi, size_t pi_count); It's not as nice as a non-historical API (eg sys_dio), but it also probably plays nicer with buffered I/O. I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); This is also what I've envisioned. Updating io_prep / async I/O is reasonably easy as its been using a separate structure for passing in the I/O details. Normal read/write calls don't really map as you simply don't have enough parameter to feed PI information into the kernel. So for that you'd need to invent a new interface / syscall. For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Me too, in multiple fronts. It's part of my general concern about things we would like for user-mode servers I think that the current aio and libaio Interface is broken for a long time, for multitude of reasons. For instance the nested structure definitions are COMPAT broken, and lots of missing pieces. (For example search in archives for why bsg does not support sg-lists.) And there are all these additions that everyone wants on top, that call for a new interface anyway. So I would like to see a deep fixup of this interface, with an aio version2 that can take into considerations, all of future needs including these above. Kernel code will be very happy to be implemented with the new, interface and a COMPAT layer could be put in place for the old interface. All interested parties should bring to the table what is the extension/changes they need. And we can try and union all of them together. (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) I wanted to add that there is another LSF/MM thread going on about: [LSF TOPIC] What to do about O_DIRECT? All these guys should be participating here, so to change core structures and behavior to a better model, that helps us here, and not against us. (Again libaio should be changed in concert with Kernel's new API, and we can sacrifice old user-mode performance, with a COMPAT layer. Distro
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 02:29 PM, Bart Van Assche wrote: On 02/07/13 13:08, Boaz Harrosh wrote: (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) Do you think it would help / make sense if sg_alloc_table() would be modified such that it allocates the entire scatterlist table via one vmalloc() call instead of chaining several page-sized scatterlist tables ? Note: such a change is not possible without modifying scsi_alloc_sgtable(). I don't think so, no. sg_alloc_table() is used not only for direct IO also for buffered, Now vmalloc() is terribly slow and would be a bottleneck in today's SSD performance. I love it that the Linux Kernel never uses vmalloc internally, and only ever chains everything to upto PAGE_SIZE sized objects. Coming from all these other OSs that don't, believe me, it is great great performance pain. (TLBs are a bitch) Bart. Thanks Boaz -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On 02/07/2013 02:33 PM, Hannes Reinecke wrote: On 02/07/2013 01:16 PM, Boaz Harrosh wrote: (Again libaio should be changed in concert with Kernel's new API, and we can sacrifice old user-mode performance, with a COMPAT layer. Distro maintainers should consider replacing libaio, together with the new Kernel, so it is only those that do their own mix-and-match, who can fix that mismatch too) And while we're at it, I still would _love_ to connect aio_cancel() and blk_abort_request(). That way we could sensibly abort an I/O and get out of the darn 'D' state. Yes!! Thanks. It is very interesting how the socket side of the world had it correct for ages, and the same fd object on disks is second grade citizen in UNIX land. (Anybody voting for epoll on async disk IO? ) Thanks Hannes yes that too. And wait_interuptable() too, at couple of places, will need some serious error handling audit for that. Cheers, Hannes Boaz -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Boaz Harrosh bharr...@panasas.com writes: I also pondered simply adding a new io_prep_* function + IO_CMD_ code to libaio and all the other plumbing necessary to make that happen... void io_prep_preadv_pi(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset, const void *pi, size_t pi_count); This is also what I've envisioned. Updating io_prep / async I/O is reasonably easy as its been using a separate structure for passing in the I/O details. Normal read/write calls don't really map as you simply don't have enough parameter to feed PI information into the kernel. So for that you'd need to invent a new interface / syscall. For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Sure, it's easy to start there, but then you eventually end up having to add a non-aio interface as well. Let's not take the latter off the table. Me too, in multiple fronts. It's part of my general concern about things we would like for user-mode servers I think that the current aio and libaio Interface is broken for a long time, for multitude of reasons. For instance the nested structure definitions are COMPAT broken News to me. I run the libaio test harness built with -m32 on 64 bit regularly. What, exactly, is broken? , and lots of missing pieces. (For example search in archives for why bsg does not support sg-lists.) And there are all these additions that everyone wants on top, that call for a new interface anyway. What was proposed above does not require a new interface. It's just an additional IO_CMD_*. I'm not saying there aren't reasons for a new interface, it's just I didn't see any in this thread. So I would like to see a deep fixup of this interface, with an aio version2 that can take into considerations, all of future needs including these above. Kernel code will be very happy to be implemented with the new, interface and a COMPAT layer could be put in place for the old interface. All interested parties should bring to the table what is the extension/changes they need. And we can try and union all of them together. (My addition is for support of sg_lists to bsg, in a way that makes Tomo happy I know that qemu was wanting this for a while as well as the multitude of user-mode servers) I'm not sure how that's directly related to aio, but ok. If we're going to rewrite the aio code, I think Zach's acall would be a good start, at least on the API front: http://lwn.net/Articles/316806/ Cheers, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote: Boaz Harrosh bharr...@panasas.com writes: For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Sure, it's easy to start there, but then you eventually end up having to add a non-aio interface as well. Let's not take the latter off the table. I agree that a sync variant should't be ignored, but needing a sync interface with PI arguments also shouldn't get in the way of adding support to the aio+dio path. Simply because it's what people use :/. I'm not sure how that's directly related to aio, but ok. If we're going to rewrite the aio code, I think Zach's acall would be a good start, at least on the API front: http://lwn.net/Articles/316806/ Yeah, I'm happy to chat about this stuff if people are interested. I think I'd do things differently today than what was done in that aged acall prototype. - z -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Dear LSF committee, I'd like to explicitly request attendance for this discussion :-) Joel On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote: On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote: Boaz Harrosh bharr...@panasas.com writes: For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Sure, it's easy to start there, but then you eventually end up having to add a non-aio interface as well. Let's not take the latter off the table. I agree that a sync variant should't be ignored, but needing a sync interface with PI arguments also shouldn't get in the way of adding support to the aio+dio path. Simply because it's what people use :/. I'm not sure how that's directly related to aio, but ok. If we're going to rewrite the aio code, I think Zach's acall would be a good start, at least on the API front: http://lwn.net/Articles/316806/ Yeah, I'm happy to chat about this stuff if people are interested. I think I'd do things differently today than what was done in that aged acall prototype. - z -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- You can get more with a kind word and a gun than you can with a kind word alone. - Al Capone http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Darrick == Darrick J Wong darrick.w...@oracle.com writes: Darrick and more recently I've theorized that we could add a magic Darrick fcntl/ioctl to make the kernel recognize, say, the first iovec Darrick of a O_DIRECT *{read,write}v call as the PI buffer, which I Darrick think is similar to how DIX gets PI data to a disk. But it's Darrick not like I have any code to show for it. I don't particularly like the stick it in the first iovec magic. Also, we need a bit more than this. A handful of knobs need to be present to convey how the PI should be sliced and diced. So then we get into the territory where the first iovec is a PI descriptor of some sort. And then the second entry is the PI buffer. Darrick I /think/ it's fairly straightforward to change the directio Darrick submit code to find the userspace PI buffer and amend the block Darrick integrity code to attach our own PI buffer. I recommend that you check out how I do this in oracleasm. Darrick You'd still have to let the block layer set the sector # field, Darrick but afaik that won't affect the crc or the app tag. Correct. But the right way would be to pass the ref tag seed in as part of the IOCB and let sd or the HBA hardware do the remapping. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Joel == Joel Becker jl...@evilplan.org writes: Joel I'm happy to chat about it. Unfortunately, like Darrick says, Joel sys_dio() coding hasn't happened. I do think we're better off Joel with some kind of explicit API than some magic state on the file. Joel I mean, even something like: Joel ssize_t write_with_pi(int fd, const void *buf, size_t count, Joel const void *pi, size_t pi_count); Joel It's not as nice as a non-historical API (eg sys_dio), but it also Joel probably plays nicer with buffered I/O. Pretty much everyone I have talked to that are interested in explicitly attaching PI (as opposed to relying on the kernel doing it) are using Linux aio. I am not opposed to having more read()/write() like interface as well. But I think it's important to cater to the I/O paradigm used by the applications interested in this. It's a lot easier to tweak a few IOCB fields than it is to rewrite how an application does I/O. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Ben == Ben Myers b...@sgi.com writes: Ben I'm interested in discussing how to pass protection information to Ben and from userspace. Maybe Martin could be enlisted for the Ben discussion. I'll be there, obviously. Ben I read that some work has already been done in this area but have Ben not been able to locate it. It looks like the bio-integrity code Ben already makes it possible to generate the t10-dif crc in the Ben filesystem. Yep. Although the block layer will generate the PI when the filesystem submits the bio. So until we have a userland conduit there hasn't been much point in the filesystems mucking with the PI explicitly. Ben It would be good to be able to get the guard and application tags Ben back out to backup applications such as xfsdump. Enabling other Ben applications to generate their own tags in userspace is also Ben interesting. However, the app tag is really only good for disk drives. Most array vendors use it internally. And going forward we're going to use it for access control instead of opaque storage. So exposing the application tag space to userland applications is of very limited use at this point. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Thu, Feb 07, 2013 at 09:36:39AM -0800, Joel Becker wrote: Dear LSF committee, I'd like to explicitly request attendance for this discussion :-) http://marc.info/?l=linux-fsdevelm=135894412908342w=2 Also, the way I compile the list of requests is from thread heads ... that means don't send your attendee request as a reply to something else either otherwise it might get missed. --b. Joel On Thu, Feb 07, 2013 at 09:27:35AM -0800, Zach Brown wrote: On Thu, Feb 07, 2013 at 11:19:59AM -0500, Jeff Moyer wrote: Boaz Harrosh bharr...@panasas.com writes: For aio we just need to add additional fields to an existing structure. So yeah, I'd be interested in that discussion as well. Sure, it's easy to start there, but then you eventually end up having to add a non-aio interface as well. Let's not take the latter off the table. I agree that a sync variant should't be ignored, but needing a sync interface with PI arguments also shouldn't get in the way of adding support to the aio+dio path. Simply because it's what people use :/. I'm not sure how that's directly related to aio, but ok. If we're going to rewrite the aio code, I think Zach's acall would be a good start, at least on the API front: http://lwn.net/Articles/316806/ Yeah, I'm happy to chat about this stuff if people are interested. I think I'd do things differently today than what was done in that aged acall prototype. - z -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- You can get more with a kind word and a gun than you can with a kind word alone. - Al Capone http://www.jlbec.org/ jl...@evilplan.org -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
Darrick == Darrick J Wong darrick.w...@oracle.com writes: Darrick Is there a newer one than this? Darrick https://oss.oracle.com/projects/oracleasm/files/sources/ UEK2 git. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. Well, I hope I'll scrape together the time to hack together a PoC before LSF... on the other hand, I ran the discussion about PI userland interfaces at LPC2011 and (shamefully) haven't done anything yet. end rambling --D Regards, Ben -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [LSF/MM TOPIC][ATTEND] protection information and userspace
On Feb 6, 2013, at 3:24 PM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Feb 06, 2013 at 01:51:22PM -0600, Ben Myers wrote: Hi, I'm interested in discussing how to pass protection information to and from userspace. Maybe Martin could be enlisted for the discussion. I read that some work has already been done in this area but have not been able to locate it. It looks like the bio-integrity code already makes it possible to generate the t10-dif crc in the filesystem. It would be good to be able to get the guard and application tags back out to backup applications such as xfsdump. Enabling other applications to generate their own tags in userspace is also interesting. This one's been on my list for a couple of years (and companies) too. A few years ago Joel Becker had support for it in his sys_dio proposal (that hasn't gone anywhere), and more recently I've theorized that we could add a magic fcntl/ioctl to make the kernel recognize, say, the first iovec of a O_DIRECT *{read,write}v call as the PI buffer, which I think is similar to how DIX gets PI data to a disk. But it's not like I have any code to show for it. I /think/ it's fairly straightforward to change the directio submit code to find the userspace PI buffer and amend the block integrity code to attach our own PI buffer. You'd still have to let the block layer set the sector # field, but afaik that won't affect the crc or the app tag. I hear that the NFS guys want to propose some sort of protocol for transmitting PI data (across NFS), but I haven't seen anything concrete yet. I'm writing a requirements document for the NFS protocol which I can discuss at LSF. The use cases for NFS for now would be virtual disk devices (hypervisors) or direct NFS access to storage from user space. Like everyone else we are waiting for a magical VFS and user space API to appear that can pass PI to and from storage. Well, I hope I'll scrape together the time to hack together a PoC before LSF... on the other hand, I ran the discussion about PI userland interfaces at LPC2011 and (shamefully) haven't done anything yet. end rambling --D Regards, Ben -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html