Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Sat, 2005-03-12 at 11:55 -0500, Dave Wysochanski wrote: > Alex Aizman wrote: > > > > > This is to announce Open-iSCSI project: High-Performance iSCSI > > Initiator for > > Linux. > > > > MOTIVATION > > == > > > > Our initial motivations for the project were: (1) implement the right > > user/kernel split, and (2) design iSCSI data path for performance. > > Recently > > we added (3): get accepted into the mainline kernel. > > > > As far as user/kernel, the existing iSCSI initiators bloat the kernel > > with > > ever-growing control plane code, including but not limited to: iSCSI > > discovery, Login (Authentication and Operational), session and connection > > management, connection-level error processing, iSCSI Text, Nop-Out/In, > > Async > > Message, iSNS, SLP, Radius... Open-iSCSI puts the entire control plane in > > the user space. This control plane talks to the data plane via well > > defined > > interface over the netlink transport. > > > > (Side note: prior to closing on the netlink we considered: sysfs, > > ioctl, and > > syscall. Because the entire control plane logic resides in the user > > space, > > we needed a real bi-directional transport that could support asynchronous > > API to transfer iSCSI control PDUs: Login, Logout, Nop-in, Nop-Out, Text, > > Async Message. > > > > Performance. > > This is the major goal and motivation for this project. As it happens, > > iSCSI > > has to compete with Fibre Channel, which is a more entrenched > > technology in > > the storage space. In addition, the "soft" iSCSI implementation have > > to show > > good results in presence of specialized hardware offloads. > > > > Our today's performance numbers are: > > > > - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block > > size); > > > > - 320MB/sec Write on a single connection (2-way 2.4Ghz Opteron, 64KB > > block > > size); > > > > - 50,000 Read IOPS on a single connection (2-way 2.4Ghz Opteron, 4KB > > block > > size). > > > Has anyone on the list verified these #'s? as far I know, no one tried that but me. We've used disktest with O_DIRECT flag set on 10Gbps network with jumbo frames enabled and big big big TCP window & socket buffer. I really would like to see that this number gets reproduced not on my setup only. > I'm trying to > get open-iscsi to work but it looks like it's got a problem > in the very initial stages of lun scanning that prevents > my target from working. Open-iscsi guys I have a trace > if you want to look at it. Looks like despite the fact that > report luns is returned successfully and only 1 lun is > returned (lun 0), the initiator is still sending inquiry > commands to luns > 0, and it looks like it gets confused > when it gets a 0x3f inquiry response from the target > (for an inquiry to lun 1), tries to issue a TMF abort > task on the previous inquiry which has already completed, > and the target responds with "task not in task set", which > is understandable since the command has already completed. > I used the latest .169 code. its too old anyways. try subversion's repository. but I doubt it will help in your case. > I don't see this problem with the latest linux-iscsi.sfnet > code and have interoperated with many other initiators, > so I'm fairly confident there's a bug in open-iscsi somewhere. i'm pretty sure it is a bug in open-iscsi. which target are you using? can we get remote access? > > > Prior to starting from-scratch the data path code we did evaluate the > > sfnet > > Initiator. And eventually decided against patching it. Instead, we reused > > its Discovery, Login, etc. control plane code. > > Technically, it was the shortest way to achieve the (1) and (2) goals > > stated > > above. We believe that it remains the easiest and the most practical > > thing > > on the larger scale of: iSCSI for Linux. > > > > > > STATUS > > == > > > > There's a 100% working code that interoperates with all (count=5) iSCSI > > targets we could get our hands on. > > > > The software was tested on AMD Opteron (TM) and Intel Xeon (TM). > > > > Code is available online via either Subversion source control database or > > the latest development release (i.e., the tarball containing Open-iSCSI > > sources, including user space, that will build and run on kernels > > starting > > 2.6.10). > > > > http://www.open-iscsi.org > > > > Features: > > > > - highly optimized and small-footprint data path; > > - multiple outstanding R2Ts; > > - thread-less receive; > > - sendpage() based transmit; > > - zero-copy header processing on receive; > > - no data path memory allocations at runtime; > > - persistent configuration database; > > - SendTargets discovery; > > - CHAP; > > - DataSequenceInOrder=No; > > - PDU header Digest; > > - multiple sessions; > > - MC/S (note: disabled in the patch); > > - SCSI-level recovery via Abort Task and session re-open. > > > > > > TODO > > ==
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Alex Aizman wrote: This is to announce Open-iSCSI project: High-Performance iSCSI Initiator for Linux. MOTIVATION == Our initial motivations for the project were: (1) implement the right user/kernel split, and (2) design iSCSI data path for performance. Recently we added (3): get accepted into the mainline kernel. As far as user/kernel, the existing iSCSI initiators bloat the kernel with ever-growing control plane code, including but not limited to: iSCSI discovery, Login (Authentication and Operational), session and connection management, connection-level error processing, iSCSI Text, Nop-Out/In, Async Message, iSNS, SLP, Radius... Open-iSCSI puts the entire control plane in the user space. This control plane talks to the data plane via well defined interface over the netlink transport. (Side note: prior to closing on the netlink we considered: sysfs, ioctl, and syscall. Because the entire control plane logic resides in the user space, we needed a real bi-directional transport that could support asynchronous API to transfer iSCSI control PDUs: Login, Logout, Nop-in, Nop-Out, Text, Async Message. Performance. This is the major goal and motivation for this project. As it happens, iSCSI has to compete with Fibre Channel, which is a more entrenched technology in the storage space. In addition, the "soft" iSCSI implementation have to show good results in presence of specialized hardware offloads. Our today's performance numbers are: - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block size); - 320MB/sec Write on a single connection (2-way 2.4Ghz Opteron, 64KB block size); - 50,000 Read IOPS on a single connection (2-way 2.4Ghz Opteron, 4KB block size). Has anyone on the list verified these #'s? I'm trying to get open-iscsi to work but it looks like it's got a problem in the very initial stages of lun scanning that prevents my target from working. Open-iscsi guys I have a trace if you want to look at it. Looks like despite the fact that report luns is returned successfully and only 1 lun is returned (lun 0), the initiator is still sending inquiry commands to luns > 0, and it looks like it gets confused when it gets a 0x3f inquiry response from the target (for an inquiry to lun 1), tries to issue a TMF abort task on the previous inquiry which has already completed, and the target responds with "task not in task set", which is understandable since the command has already completed. I used the latest .169 code. I don't see this problem with the latest linux-iscsi.sfnet code and have interoperated with many other initiators, so I'm fairly confident there's a bug in open-iscsi somewhere. Prior to starting from-scratch the data path code we did evaluate the sfnet Initiator. And eventually decided against patching it. Instead, we reused its Discovery, Login, etc. control plane code. Technically, it was the shortest way to achieve the (1) and (2) goals stated above. We believe that it remains the easiest and the most practical thing on the larger scale of: iSCSI for Linux. STATUS == There's a 100% working code that interoperates with all (count=5) iSCSI targets we could get our hands on. The software was tested on AMD Opteron (TM) and Intel Xeon (TM). Code is available online via either Subversion source control database or the latest development release (i.e., the tarball containing Open-iSCSI sources, including user space, that will build and run on kernels starting 2.6.10). http://www.open-iscsi.org Features: - highly optimized and small-footprint data path; - multiple outstanding R2Ts; - thread-less receive; - sendpage() based transmit; - zero-copy header processing on receive; - no data path memory allocations at runtime; - persistent configuration database; - SendTargets discovery; - CHAP; - DataSequenceInOrder=No; - PDU header Digest; - multiple sessions; - MC/S (note: disabled in the patch); - SCSI-level recovery via Abort Task and session re-open. TODO The near term plan is: test, test, and test. We need to stabilize the existing code, after 5 months of development this seems to be the right thing to do. Other short-term plans include: a) process community feedback, implement comments and apply patches; b) cleanup user side of the iSCSI open interface; use API calls (instead of directly constructing events); c) eliminate runtime control path memory allocations (for Nop-In, Nop-Out, etc.); d) implement Write path optimizations (delayed because of the self-imposed submission deadline); e) oProfile the data path, use the reports for further optimization; f) complete the readme. Comments, code reviews, patches - are greatly appreciated! THANKS == Special thanks to our first reviewers: Christoph Hellwig and Mike Christie. Special thanks to Ming Zhang for help in testing and for insightful questions. Regards, Alex Aizman & Dmitry Yusupov ===
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Thu, 2005-03-10 at 11:27 +0100, Lars Marowsky-Bree wrote: > On 2005-03-09T18:36:37, Alex Aizman <[EMAIL PROTECTED]> wrote: > > >That works well in our current development series, and if you want to > > >share code, you can either rip it off (Open Source, we love ya ;) or we > > >can spin off these parts into a sub-package for you to depend on... > > If it's not a big deal :-) let's do the "sub-package" option. > > I've brought this up on the linux-ha-dev list. When do you need this? For open-iscsi, I think it would make sense to link open-iscs daemon code against klibc. The same way dm-multipath do. This will allow as to build iSCSI remote boot using early user-space. Not sure it will be possible to use your package without modifications. Let me know. Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On 2005-03-09T18:36:37, Alex Aizman <[EMAIL PROTECTED]> wrote: > Heartbeat is good for reliability, etc. WRT "getting paged-out" - > non-deterministic (things depend on time), right? Right, if we didn't get scheduled often enough for us to send our heartbeat messages to the other peers, they'll evict us from the cluster and fence us, causing a service disruption. With all these protections in place though, we can run at roughly 50ms heartbeat intervals from user-space, reliably, which allows us a node dead timer of ~200ms. I think that's pretty damn good. (Of course, realistically, even for subsecond fail-over, 200ms keep alives are sufficient, and 50ms would be quite extreme. But, it works.) > >That works well in our current development series, and if you want to > >share code, you can either rip it off (Open Source, we love ya ;) or we > >can spin off these parts into a sub-package for you to depend on... > If it's not a big deal :-) let's do the "sub-package" option. I've brought this up on the linux-ha-dev list. When do you need this? Sincerely, Lars Marowsky-Brée <[EMAIL PROTECTED]> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Lars Marowsky-Bree wrote: On 2005-03-08T22:25:29, Alex Aizman <[EMAIL PROTECTED]> wrote: There's (or at least was up until today) an ongoing discussion on our mailing list at http://groups-beta.google.com/group/open-iscsi. The short and long of it: the problem can be solved, and it will. Couple simple things we already do: mlockall() to keep the daemon un-swapped, and also looking into potential dependency created by syslog (there's one for 2.4 kernel, not sure if this is an issue for 2.6). BTW, to get around the very same issues, heartbeat does much the same: lock itself into memory, reserve a couple of pages more to spare on stack & heap, run at soft-realtime priority. Heartbeat is good for reliability, etc. WRT "getting paged-out" - non-deterministic (things depend on time), right? syslog(), however, sucks. It does. We went down the path of using our non-blocking IPC library to have all our various components log to ha_logd, which then logs to syslog() or writes to disk or wherever. Found ha_logd under http://linux-ha.org. The latter is extemely interesting in the longer term. In the short term, there's quite a bit of information on this site, need time. That works well in our current development series, and if you want to share code, you can either rip it off (Open Source, we love ya ;) or we can spin off these parts into a sub-package for you to depend on... If it's not a big deal :-) let's do the "sub-package" option. The sfnet is a learning experience; it is by no means a proof that it cannot be done. I'd also argue that it MUST be done, because the current way of "Oh, it's somehow related to block stuff, must be in kernel" leads down to hell. We better figure out good ways around it ;-) Yes, it MUST be done. Sincerely, Lars Marowsky-Brée <[EMAIL PROTECTED]> Alex - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Bryan Henderson wrote: Its fundamental to Unix architecture that user programs sit above the kernel and get services from the kernel, and turning that on its head so that the kernel depends on a user space program to do something as fundamental as a pageout can't come to any good. Agreed. Even if we're able to identify every assumption in Linux design that's being broken and patch it up with something like mlockall, tomorrow someone will change something, using his vast experience with engineering user space programs, and break it again. In the same vein can quote Murphy's laws here, like "things get worse under pressure", etc. Specifics, please! Maybe a more important point is that to the extent that you make a user space process a proxy of the kernel, you lose all those advantages that are supposed to come with user space engineering: you can now totally hose your system by making a mistake in the code; you have to be intimately familiar with kernel internals to get the code right, etc. mlockall is meant to provide performance characteristics to a user space process. If it can be used to create resource ordering correctness, it's an accident. Syslog is one of those things you're not supposed to have to think about. If every time you issue a message you have to visualize the entire syslog system with all of its options to know if you've broken something, the simplicity of user space is gone. The issue with syslog is clear, we are looking into implementation options. I would think as a basic design principle that if kernel code ever has to wait for a user process to do something, Open-iSCSI is designed for this to never happen. The user/kernel interface (see include/iscsi_if.h) is non-blocking. Once the transport connection gets established (by the user) and handed over to the kernel (to transport block data), the rest data path happens without user interaction. Command level error processing is done by the kernel part, _no_ dependency here. Connection-level error processing is _delegated_ to the user, kernel does not block - it just expects the "higher" authority to make the decision wrt faulty connection. it should do it high in a system call, in user context, interruptibly, etc. I don't know if it's possible to arrange to have the ISCSI initiator waiting there, but I think it's the only way having a user space process can do what it's supposed to do. I also don't know whether it is possible (but I know that if it were possible it would by quite ugly); the main thing however is - this is not what's happening. I'd like to remind everyone that user space processes aside, there's still a basic ISCSI initiator resource inversion that needs to be fixed to avoid deadlock: ISCSI initiator driver sits below the memory pool (I..e a pageout involving ISCSI might be prerequisite to getting memory) Socket layer sits below ISCSI (ISCSI initiator driver relies on socket services) Memory pool sits below socket layer (socket layer allocates memory from the main pool) One might say as long as that's there, close is good enough on the user space initiator component. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems Alex - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On 2005-03-08T22:25:29, Alex Aizman <[EMAIL PROTECTED]> wrote: > There's (or at least was up until today) an ongoing discussion on our > mailing list at http://groups-beta.google.com/group/open-iscsi. The > short and long of it: the problem can be solved, and it will. Couple > simple things we already do: mlockall() to keep the daemon un-swapped, > and also looking into potential dependency created by syslog (there's > one for 2.4 kernel, not sure if this is an issue for 2.6). BTW, to get around the very same issues, heartbeat does much the same: lock itself into memory, reserve a couple of pages more to spare on stack & heap, run at soft-realtime priority. syslog(), however, sucks. We went down the path of using our non-blocking IPC library to have all our various components log to ha_logd, which then logs to syslog() or writes to disk or wherever. That works well in our current development series, and if you want to share code, you can either rip it off (Open Source, we love ya ;) or we can spin off these parts into a sub-package for you to depend on... > The sfnet is a learning experience; it is by no means a proof that it > cannot be done. I'd also argue that it MUST be done, because the current way of "Oh, it's somehow related to block stuff, must be in kernel" leads down to hell. We better figure out good ways around it ;-) Sincerely, Lars Marowsky-Brée <[EMAIL PROTECTED]> -- High Availability & Clustering SUSE Labs, Research and Development SUSE LINUX Products GmbH - A Novell Business - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Wed, Mar 09, 2005 at 11:29:03AM -0800, Bryan Henderson wrote: > I'd like to remind everyone that user space processes aside, there's still > a basic ISCSI initiator resource inversion that needs to be fixed to avoid > deadlock: > > ISCSI initiator driver sits below the memory pool > (I..e a pageout involving ISCSI might be prerequisite to getting memory) > Socket layer sits below ISCSI > (ISCSI initiator driver relies on socket services) > Memory pool sits below socket layer > (socket layer allocates memory from the main pool) > > One might say as long as that's there, close is good enough on the user > space initiator component. This issue is becoming increasingly pressing and hopefully there'll be a plan for addressing it soon. Unfortunately, none of the proposed solutions help the userspace inversion problem. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
>The >short and long of it: the problem can be solved, and it will. Couple >simple things we already do: mlockall() to keep the daemon un-swapped, >and also looking into potential dependency created by syslog (there's >one for 2.4 kernel, not sure if this is an issue for 2.6). I think it's probably possible and good to do most of this complex stuff in user space, but it looks to me like it's going down exactly the wrong road. Its fundamental to Unix architecture that user programs sit above the kernel and get services from the kernel, and turning that on its head so that the kernel depends on a user space program to do something as fundamental as a pageout can't come to any good. Even if we're able to identify every assumption in Linux design that's being broken and patch it up with something like mlockall, tomorrow someone will change something, using his vast experience with engineering user space programs, and break it again. Maybe a more important point is that to the extent that you make a user space process a proxy of the kernel, you lose all those advantages that are supposed to come with user space engineering: you can now totally hose your system by making a mistake in the code; you have to be intimately familiar with kernel internals to get the code right, etc. mlockall is meant to provide performance characteristics to a user space process. If it can be used to create resource ordering correctness, it's an accident. Syslog is one of those things you're not supposed to have to think about. If every time you issue a message you have to visualize the entire syslog system with all of its options to know if you've broken something, the simplicity of user space is gone. I would think as a basic design principle that if kernel code ever has to wait for a user process to do something, it should do it high in a system call, in user context, interruptibly, etc. I don't know if it's possible to arrange to have the ISCSI initiator waiting there, but I think it's the only way having a user space process can do what it's supposed to do. I'd like to remind everyone that user space processes aside, there's still a basic ISCSI initiator resource inversion that needs to be fixed to avoid deadlock: ISCSI initiator driver sits below the memory pool (I..e a pageout involving ISCSI might be prerequisite to getting memory) Socket layer sits below ISCSI (ISCSI initiator driver relies on socket services) Memory pool sits below socket layer (socket layer allocates memory from the main pool) One might say as long as that's there, close is good enough on the user space initiator component. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Tue, 2005-03-08 at 22:50 -0800, Matt Mackall wrote: > On Tue, Mar 08, 2005 at 10:25:58PM -0800, Dmitry Yusupov wrote: > > On Tue, 2005-03-08 at 22:05 -0800, Matt Mackall wrote: > > > On Tue, Mar 08, 2005 at 09:51:39PM -0800, Alex Aizman wrote: > > > > Matt Mackall wrote: > > > > > > > > >How big is the userspace client? > > > > > > > > > Hmm.. x86 executable? source? > > > > > > > > Anyway, there's about 12,000 lines of user space code, and growing. In > > > > the kernel we have approx. 3,300 lines. > > > > > > > > >>- 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB > > > > >>block > > > > >>size); > > > > > > > > > >With what network hardware and drives, please? > > > > > > > > > Neterion's 10GbE adapters. RAM disk on the target side. > > > > > > Ahh. > > > > > > Snipped my question about userspace deadlocks - that was the important > > > one. It is in fact why the sfnet one is written as it is - it > > > originally had a userspace component and turned out to be easy to > > > deadlock under load because of it. > > > > As Scott Ferris pointed out, the main reason for deadlock in sfnet was > > blocking behavior of page cache when daemon tried to do filesystem IO, > > namely syslog(). > > That was just one of several problems. And ISTR deciding that > particular one was quite nasty when we first encountered it though I > no longer remember the details. that's bad. since all those details might help us to avoid problems and save time in the future daemon design. I will really appreciate you will point me to other potential problems once you recall. > > > That was 2.4.x kernel. We don't know whether it is > > fixed in 2.6.x. If someone knows, please let us know. Meanwhile we came > > up with work-around design in user-space. "Paged out" problem fixed > > already in our subversion repository by utilizing mlockall() > > syscall. > > I presume this is dynamically linked against glibc? over time it will be linked against klibc as dm-multipath do. It will also help to implement iSCSI boot, when control plane daemon will be part of initramfs image. > > Also we have IMHO, working solution for OOM during ERL=0 TCP re-connect. > > Care to describe it? sure. the idea was to always keep second reserved/redundant TCP connection per session opened. (please note, TCP connection, not iSCSI connection). This way during recovery cycle in case of sane target, initiator will switch into redundant TCP connection and send Login request over. This could be implemented as a feature and might be disabled via configuration utility if needed. Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Tue, Mar 08, 2005 at 10:25:58PM -0800, Dmitry Yusupov wrote: > On Tue, 2005-03-08 at 22:05 -0800, Matt Mackall wrote: > > On Tue, Mar 08, 2005 at 09:51:39PM -0800, Alex Aizman wrote: > > > Matt Mackall wrote: > > > > > > >How big is the userspace client? > > > > > > > Hmm.. x86 executable? source? > > > > > > Anyway, there's about 12,000 lines of user space code, and growing. In > > > the kernel we have approx. 3,300 lines. > > > > > > >>- 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB > > > >>block > > > >>size); > > > > > > > >With what network hardware and drives, please? > > > > > > > Neterion's 10GbE adapters. RAM disk on the target side. > > > > Ahh. > > > > Snipped my question about userspace deadlocks - that was the important > > one. It is in fact why the sfnet one is written as it is - it > > originally had a userspace component and turned out to be easy to > > deadlock under load because of it. > > As Scott Ferris pointed out, the main reason for deadlock in sfnet was > blocking behavior of page cache when daemon tried to do filesystem IO, > namely syslog(). That was just one of several problems. And ISTR deciding that particular one was quite nasty when we first encountered it though I no longer remember the details. > That was 2.4.x kernel. We don't know whether it is > fixed in 2.6.x. If someone knows, please let us know. Meanwhile we came > up with work-around design in user-space. "Paged out" problem fixed > already in our subversion repository by utilizing mlockall() > syscall. I presume this is dynamically linked against glibc? > Also we have IMHO, working solution for OOM during ERL=0 TCP re-connect. Care to describe it? -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Matt Mackall wrote: On Tue, Mar 08, 2005 at 09:51:39PM -0800, Alex Aizman wrote: Matt Mackall wrote: How big is the userspace client? Hmm.. x86 executable? source? Anyway, there's about 12,000 lines of user space code, and growing. In the kernel we have approx. 3,300 lines. - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block size); With what network hardware and drives, please? Neterion's 10GbE adapters. RAM disk on the target side. Ahh. Snipped my question about userspace deadlocks - that was the important one. It is in fact why the sfnet one is written as it is - it originally had a userspace component and turned out to be easy to deadlock under load because of it. There's (or at least was up until today) an ongoing discussion on our mailing list at http://groups-beta.google.com/group/open-iscsi. The short and long of it: the problem can be solved, and it will. Couple simple things we already do: mlockall() to keep the daemon un-swapped, and also looking into potential dependency created by syslog (there's one for 2.4 kernel, not sure if this is an issue for 2.6). The sfnet is a learning experience; it is by no means a proof that it cannot be done. Alex - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Tue, 2005-03-08 at 22:05 -0800, Matt Mackall wrote: > On Tue, Mar 08, 2005 at 09:51:39PM -0800, Alex Aizman wrote: > > Matt Mackall wrote: > > > > >How big is the userspace client? > > > > > Hmm.. x86 executable? source? > > > > Anyway, there's about 12,000 lines of user space code, and growing. In > > the kernel we have approx. 3,300 lines. > > > > >>- 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block > > >>size); > > > > > >With what network hardware and drives, please? > > > > > Neterion's 10GbE adapters. RAM disk on the target side. > > Ahh. > > Snipped my question about userspace deadlocks - that was the important > one. It is in fact why the sfnet one is written as it is - it > originally had a userspace component and turned out to be easy to > deadlock under load because of it. As Scott Ferris pointed out, the main reason for deadlock in sfnet was blocking behavior of page cache when daemon tried to do filesystem IO, namely syslog(). That was 2.4.x kernel. We don't know whether it is fixed in 2.6.x. If someone knows, please let us know. Meanwhile we came up with work-around design in user-space. "Paged out" problem fixed already in our subversion repository by utilizing mlockall() syscall. Also we have IMHO, working solution for OOM during ERL=0 TCP re-connect. Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Tue, Mar 08, 2005 at 09:51:39PM -0800, Alex Aizman wrote: > Matt Mackall wrote: > > >How big is the userspace client? > > > Hmm.. x86 executable? source? > > Anyway, there's about 12,000 lines of user space code, and growing. In > the kernel we have approx. 3,300 lines. > > >>- 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block > >>size); > > > >With what network hardware and drives, please? > > > Neterion's 10GbE adapters. RAM disk on the target side. Ahh. Snipped my question about userspace deadlocks - that was the important one. It is in fact why the sfnet one is written as it is - it originally had a userspace component and turned out to be easy to deadlock under load because of it. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Matt Mackall wrote: How big is the userspace client? Hmm.. x86 executable? source? Anyway, there's about 12,000 lines of user space code, and growing. In the kernel we have approx. 3,300 lines. - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block size); With what network hardware and drives, please? Neterion's 10GbE adapters. RAM disk on the target side. Alex - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
On Sun, Mar 06, 2005 at 11:03:14PM -0800, Alex Aizman wrote: > As far as user/kernel, the existing iSCSI initiators bloat the kernel with > ever-growing control plane code, including but not limited to: iSCSI > discovery, Login (Authentication and Operational), session and connection > management, connection-level error processing, iSCSI Text, Nop-Out/In, Async > Message, iSNS, SLP, Radius... Open-iSCSI puts the entire control plane in > the user space. This control plane talks to the data plane via well defined > interface over the netlink transport. How big is the userspace client? How does this perform under memory pressure? If the userspace iSCSI client is paged out for whatever reason, and flushing _to_ an iSCSI device is necessary to page the usersace portion back in, and the connection needs restarting or the like to flush... > Performance. > This is the major goal and motivation for this project. As it happens, iSCSI > has to compete with Fibre Channel, which is a more entrenched technology in > the storage space. In addition, the "soft" iSCSI implementation have to show > good results in presence of specialized hardware offloads. > > Our today's performance numbers are: > > - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block > size); With what network hardware and drives, please? -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
This is to announce Open-iSCSI project: High-Performance iSCSI Initiator for Linux. MOTIVATION == Our initial motivations for the project were: (1) implement the right user/kernel split, and (2) design iSCSI data path for performance. Recently we added (3): get accepted into the mainline kernel. As far as user/kernel, the existing iSCSI initiators bloat the kernel with ever-growing control plane code, including but not limited to: iSCSI discovery, Login (Authentication and Operational), session and connection management, connection-level error processing, iSCSI Text, Nop-Out/In, Async Message, iSNS, SLP, Radius... Open-iSCSI puts the entire control plane in the user space. This control plane talks to the data plane via well defined interface over the netlink transport. (Side note: prior to closing on the netlink we considered: sysfs, ioctl, and syscall. Because the entire control plane logic resides in the user space, we needed a real bi-directional transport that could support asynchronous API to transfer iSCSI control PDUs: Login, Logout, Nop-in, Nop-Out, Text, Async Message. Performance. This is the major goal and motivation for this project. As it happens, iSCSI has to compete with Fibre Channel, which is a more entrenched technology in the storage space. In addition, the "soft" iSCSI implementation have to show good results in presence of specialized hardware offloads. Our today's performance numbers are: - 450MB/sec Read on a single connection (2-way 2.4Ghz Opteron, 64KB block size); - 320MB/sec Write on a single connection (2-way 2.4Ghz Opteron, 64KB block size); - 50,000 Read IOPS on a single connection (2-way 2.4Ghz Opteron, 4KB block size). Prior to starting from-scratch the data path code we did evaluate the sfnet Initiator. And eventually decided against patching it. Instead, we reused its Discovery, Login, etc. control plane code. Technically, it was the shortest way to achieve the (1) and (2) goals stated above. We believe that it remains the easiest and the most practical thing on the larger scale of: iSCSI for Linux. STATUS == There's a 100% working code that interoperates with all (count=5) iSCSI targets we could get our hands on. The software was tested on AMD Opteron (TM) and Intel Xeon (TM). Code is available online via either Subversion source control database or the latest development release (i.e., the tarball containing Open-iSCSI sources, including user space, that will build and run on kernels starting 2.6.10). http://www.open-iscsi.org Features: - highly optimized and small-footprint data path; - multiple outstanding R2Ts; - thread-less receive; - sendpage() based transmit; - zero-copy header processing on receive; - no data path memory allocations at runtime; - persistent configuration database; - SendTargets discovery; - CHAP; - DataSequenceInOrder=No; - PDU header Digest; - multiple sessions; - MC/S (note: disabled in the patch); - SCSI-level recovery via Abort Task and session re-open. TODO The near term plan is: test, test, and test. We need to stabilize the existing code, after 5 months of development this seems to be the right thing to do. Other short-term plans include: a) process community feedback, implement comments and apply patches; b) cleanup user side of the iSCSI open interface; use API calls (instead of directly constructing events); c) eliminate runtime control path memory allocations (for Nop-In, Nop-Out, etc.); d) implement Write path optimizations (delayed because of the self-imposed submission deadline); e) oProfile the data path, use the reports for further optimization; f) complete the readme. Comments, code reviews, patches - are greatly appreciated! THANKS == Special thanks to our first reviewers: Christoph Hellwig and Mike Christie. Special thanks to Ming Zhang for help in testing and for insightful questions. Regards, Alex Aizman & Dmitry Yusupov = The following 6 patches alltogether represent the Open-iSCSI Initiator: Patch 1: SCSI LLDD consists of 3 files: - iscsi_if.c (iSCSI open interface over netlink); - iscsi_tcp.[ch] (iSCSI transport over TCP/IP). Patch 2: Common header files: - iscsi_if.h (iSCSI open interface over netlink); - iscsi_proto.h (RFC3720 #defines and types); - iscsi_ifev.h (user/kernel events). Patch 3: drivers/scsi/Kconfig changes. Patch 4: drivers/scsi/Makefile changes. Patch 5: include/linux/netlink.h changes (added new protocol NETLINK_ISCSI) Patch 6: Documentation/scsi/iscsi.txt - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE 0/6] Open-iSCSI High-Performance Initiator for Linux
Alex Aizman wrote: This is to announce Open-iSCSI project: High-Performance iSCSI Initiator for Linux. MOTIVATION == Our initial motivations for the project were: (1) implement the right user/kernel split, and (2) design iSCSI data path for performance. Recently we added (3): get accepted into the mainline kernel. As far as user/kernel, the existing iSCSI initiators bloat the kernel with ever-growing control plane code, including but not limited to: iSCSI discovery, Login (Authentication and Operational), session and connection management, For iscsi-sfnet, I know it does login and auth and session and connection management/recovery in kernel becuase nobody has been able to write a usersapce daemon that can survive memory allocation failures and being paged out - are there other problems when dealing with usersapce like this. Open-iscsi seems to suffer from those problems too, but they seem like they can be fixed relatively quickly. Do you know how long it will take? Is it still two months with some of the items I descibed on the open-iscsi list in mind and after looking at what dm multipath has had to do to perform failback and path checking? As you know I agree it should be done in usersapce so please spare me the usual advertisements and magic I normally get ;) I do not need to be sold on the concept. I am just trying to get a better picture of if people will merge the kernel part with a unreliable userspace component. If so then many sourceforge devs can help test as there is no point in target vendors on that list fixing the same bugs on multiple stacks like I have been doing (almost had Pyx fixed with IBM DS300 too). If the problems of doing recovery/login/auth in usersapce are solved and well known should dm multipath move its failover to usersapce too? Doing explicit failover is essentialy the same problem and there is no point in sticking in a kernel interface for dm hw handlers when it can be done in usersapce. I mention this becuase of the MCS embargo, and the fact the I cannot imagine many storage admins running iSCSI without some sort of multipath or failover so I would like to get that ironed out too. - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html