subject:"Integration of SCST in the mainstream Linux kernel"

On Fri, 2008-02-08 at 17:42 +0300, Vladislav Bolkhovitin wrote:
> Nicholas A. Bellinger wrote:
> > On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:
> > 
> >>Is there an open iSCSI Target implementation which does NOT
> >>issue commands to sub-target devices via the SCSI mid-layer, but
> >>bypasses it completely?
> >>
> >>   Luben
> >>
> > 
> > 
> > Hi Luben,
> > 
> > I am guessing you mean futher down the stack, which I don't know this to
> > be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
> > There is a diagram explaining the basic concepts from a 10,000 foot
> > level.
> > 
> > http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf
> > 
> > Note that only traditional iSCSI target is currently implemented in v2.9
> > LIO-SE codebase in the list of target mode fabrics on left side of the
> > layout.  The API between the protocol headers that does
> > encoding/decoding target mode storage packets is probably the least
> > mature area of the LIO stack (because it has always been iSCSI looking
> > towards iSER :).  I don't know who has the most mature API between the
> > storage engine and target storage protocol for doing this between SCST
> > and STGT, I am guessing SCST because of the difference in age of the
> > projects.  Could someone be so kind to fill me in on this..?
> 
> SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
> devices in the pass-through mode. This function is slightly modified 
> version of scsi_execute_async(), which submits requests in FIFO order 
> instead of LIFO as scsi_execute_async() does (so with 
> scsi_execute_async() they are executed in the reverse order). 
> Scsi_execute_async_fifo() added as a separate patch to the kernel.

The LIO-SE PSCSI Plugin also depends on scsi_execute_async() for builds
on >= 2.6.18.  Note in the core LIO storage engine code (would be
iscsi_target_transport.c), there is no subsystem dependence logic.  The
LIO-SE API is what allows the SE plugins to remain simple and small:

-rw-r--r-- 1 root root 35008 2008-02-02 03:25 iscsi_target_pscsi.c
-rw-r--r-- 1 root root  7537 2008-02-02 17:27 iscsi_target_pscsi.h
-rw-r--r-- 1 root root 18269 2008-02-04 02:23 iscsi_target_iblock.c
-rw-r--r-- 1 root root  6834 2008-02-04 02:25 iscsi_target_iblock.h
-rw-r--r-- 1 root root 30611 2008-02-02 03:25 iscsi_target_file.c
-rw-r--r-- 1 root root  7833 2008-02-02 17:27 iscsi_target_file.h
-rw-r--r-- 1 root root 35154 2008-02-02 04:01 iscsi_target_rd.c
-rw-r--r-- 1 root root  9900 2008-02-02 17:27 iscsi_target_rd.h

It also means that the core LIO-SE code does not have to change when the
subsystem APIs change.  This has been important in the past for the
project, but for upstream code, probably would not make a huge
difference.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Fri, 2008-02-08 at 17:36 +0300, Vladislav Bolkhovitin wrote:
> Nicholas A. Bellinger wrote:
> - It has been discussed which iSCSI target implementation should be in
> the mainstream Linux kernel. There is no agreement on this subject
> yet. The short-term options are as follows:
> 1) Do not integrate any new iSCSI target implementation in the
> mainstream Linux kernel.
> 2) Add one of the existing in-kernel iSCSI target implementations to
> the kernel, e.g. SCST or PyX/LIO.
> 3) Create a new in-kernel iSCSI target implementation that combines
> the advantages of the existing iSCSI kernel target implementations
> (iETD, STGT, SCST and PyX/LIO).
> 
> As an iSCSI user, I prefer option (3). The big question is whether the
> various storage target authors agree with this ?
> >>>
> >>>I tend to agree with some important notes:
> >>>
> >>>1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
> >>>SCST 
> >>>framework with a lot of bugfixes and improvements.
> >>>
> >>>2. I think, everybody will agree that Linux iSCSI target should work over 
> >>>some standard SCSI target framework. Hence the choice gets narrower: SCST 
> >>>vs 
> >>>STGT. I don't think there's a way for a dedicated iSCSI target (i.e. 
> >>>PyX/LIO) 
> >>>in the mainline, because of a lot of code duplication. Nicholas could 
> >>>decide 
> >>>to move to either existing framework (although, frankly, I don't think 
> >>>there's a possibility for in-kernel iSCSI target and user space SCSI 
> >>>target 
> >>>framework) and if he decide to go with SCST, I'll be glad to offer my help 
> >>>and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. 
> >>>The 
> >>>better one should win.
> >>
> >>why should linux as an iSCSI target be limited to passthrough to a SCSI 
> >>device.
> > 
> > 
> > 
> > I don't think anyone is saying it should be.  It makes sense that the
> > more mature SCSI engines that have working code will be providing alot
> > of the foundation as we talk about options..
> > 
> >>From comparing the designs of SCST and LIO-SE, we know that SCST has
> > supports very SCSI specific target mode hardware, including software
> > target mode forks of other kernel code.  This code for the target mode
> > pSCSI, FC and SAS control paths (more for the state machines, that CDB
> > emulation) that will most likely never need to be emulated on non SCSI
> > target engine.
> 
> ...but required for SCSI. So, it must be, anyway.
> 
> > SCST has support for the most SCSI fabric protocols of
> > the group (although it is lacking iSER) while the LIO-SE only supports
> > traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
> > design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
> > and data to talk to every potential device in the Linux storage stack on
> > the largest amount of hardware architectures possible.
> > 
> > Most of the iSCSI Initiators I know (including non Linux) do not rely on
> > heavy SCSI task management, and I think this would be a lower priority
> > item to get real SCSI specific recovery in the traditional iSCSI target
> > for users.  Espically things like SCSI target mode queue locking
> > (affectionally called Auto Contingent Allegiance) make no sense for
> > traditional iSCSI or iSER, because CmdSN rules are doing this for us.
> 
> Sorry, it isn't correct. ACA provides possibility to lock commands queue 
> in case of CHECK CONDITION, so allows to keep commands execution order 
> in case of errors. CmdSN keeps commands execution order only in case of 
> success, in case of error the next queued command will be executed 
> immediately after the failed one, although application might require to 
> have all subsequent after the failed one commands aborted. Think about 
> journaled file systems, for instance. Also ACA allows to retry the 
> failed command and then resume the queue.
> 

Fair enough.  The point I was making is that I have never actually seen
an iSCSI Initiator use ACA functionality (I don't believe that the Linux
SCSI Ml implements this), or actually generate a CLEAR_ACA task
management request.

--nab

> Vlad
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.




I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:
> Is there an open iSCSI Target implementation which does NOT
> issue commands to sub-target devices via the SCSI mid-layer, but
> bypasses it completely?
> 
>Luben
> 

Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?

Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Thu, 2008-02-07 at 14:51 -0800, [EMAIL PROTECTED] wrote:
> On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:
> 
> > Bart Van Assche wrote:
> >> - It has been discussed which iSCSI target implementation should be in
> >> the mainstream Linux kernel. There is no agreement on this subject
> >> yet. The short-term options are as follows:
> >> 1) Do not integrate any new iSCSI target implementation in the
> >> mainstream Linux kernel.
> >> 2) Add one of the existing in-kernel iSCSI target implementations to
> >> the kernel, e.g. SCST or PyX/LIO.
> >> 3) Create a new in-kernel iSCSI target implementation that combines
> >> the advantages of the existing iSCSI kernel target implementations
> >> (iETD, STGT, SCST and PyX/LIO).
> >> 
> >> As an iSCSI user, I prefer option (3). The big question is whether the
> >> various storage target authors agree with this ?
> >
> > I tend to agree with some important notes:
> >
> > 1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
> > SCST 
> > framework with a lot of bugfixes and improvements.
> >
> > 2. I think, everybody will agree that Linux iSCSI target should work over 
> > some standard SCSI target framework. Hence the choice gets narrower: SCST 
> > vs 
> > STGT. I don't think there's a way for a dedicated iSCSI target (i.e. 
> > PyX/LIO) 
> > in the mainline, because of a lot of code duplication. Nicholas could 
> > decide 
> > to move to either existing framework (although, frankly, I don't think 
> > there's a possibility for in-kernel iSCSI target and user space SCSI target 
> > framework) and if he decide to go with SCST, I'll be glad to offer my help 
> > and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. 
> > The 
> > better one should win.
> 
> why should linux as an iSCSI target be limited to passthrough to a SCSI 
> device.
> 

I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..

>From comparing the designs of SCST and LIO-SE, we know that SCST has
supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.  SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.

> the most common use of this sort of thing that I would see is to load up a 
> bunch of 1TB SATA drives in a commodity PC, run software RAID, and then 
> export the resulting volume to other servers via iSCSI. not a 'real' SCSI 
> device in sight.
> 

I recently moved the last core LIO target machine from a hardware RAID5
to MD RAID6 with struct block_device exported LVM objects via
Linux/iSCSI to PVM and HVM domains, and I have been very happy with the
results.  Being able to export any physical or virtual storage object
from whatever layer makes sense for your particular case.  This applies
to both block and file level access.  For example, making an iSCSI
Initiator and Target run in the most limited in environments places
where NAS (espically userspace server side) would have a really hard
time fitting, has always been a requirement.  You can imagine a system
with a smaller amount of memory (say 32MB) having a difficult time doing
I/O to any amount of NAS clients.

If are talking about memory required to get best performance, using
kernel level DMA ring allocation and submission to a generic target
engine uses a significantly smaller amount of memory, than say
traditional buffered FILEIO.  Going futher up the storage stack with
buffered file IO, regardless of if its block or file level, will always
start to add overhead.  I think that kernel level FILEIO with O_DIRECT
and asyncio would probably help alot in this case for general target
mode usage of MD and LVM block devices.

This is because when we are using PSCSI or IBLOCK to queue I/Os which,
may need be different from the original IO from the initiator/client due
to OS storage subsystem differences and/or physical HBA limitiations for
the layers below block.  The current LIO-SE API e

Re: Integration of SCST in the mainstream Linux kernel


[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread david


On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up a 
bunch of 1TB SATA drives in a commodity PC, run software RAID, and then 
export the resulting volume to other servers via iSCSI. not a 'real' SCSI 
device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, offering 
storage to other systems, and using external storage). Sometimes the best 
technology doesn't win, but Linux should be interoperable with as much as 
possible and be ready to support the winners and the loosers in technology 
options, for as long as anyone chooses to use the old equipment (after 
all, we support things like Arcnet networking, which lost to Ethernet many 
years ago)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Luben Tuikov

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

   Luben

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Nicholas A. Bellinger

On Thu, 2008-02-07 at 14:13 +0100, Bart Van Assche wrote: 
> Since the focus of this thread shifted somewhat in the last few
> messages, I'll try to summarize what has been discussed so far:
> - There was a number of participants who joined this discussion
> spontaneously. This suggests that there is considerable interest in
> networked storage and iSCSI.
> - It has been motivated why iSCSI makes sense as a storage protocol
> (compared to ATA over Ethernet and Fibre Channel over Ethernet).
> - The direct I/O performance results for block transfer sizes below 64
> KB are a meaningful benchmark for storage target implementations.
> - It has been discussed whether an iSCSI target should be implemented
> in user space or in kernel space. It is clear now that an
> implementation in the kernel can be made faster than a user space
> implementation 
> (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
> Regarding existing implementations, measurements have a.o. shown that
> SCST is faster than STGT (30% with the following setup: iSCSI via
> IPoIB and direct I/O block transfers with a size of 512 bytes).
> - It has been discussed which iSCSI target implementation should be in
> the mainstream Linux kernel. There is no agreement on this subject
> yet. The short-term options are as follows:
> 1) Do not integrate any new iSCSI target implementation in the
> mainstream Linux kernel.
> 2) Add one of the existing in-kernel iSCSI target implementations to
> the kernel, e.g. SCST or PyX/LIO.
> 3) Create a new in-kernel iSCSI target implementation that combines
> the advantages of the existing iSCSI kernel target implementations
> (iETD, STGT, SCST and PyX/LIO).
> 
> As an iSCSI user, I prefer option (3). The big question is whether the
> various storage target authors agree with this ?
> 

I think the other data point here would be that final target design
needs to be as generic as possible.  Generic in the sense that the
engine eventually needs to be able to accept NDB and other ethernet
based target mode storage configurations to an abstracted device object
(struct scsi_device, struct block_device, or struct file) just as it
would for an IP Storage based request.

We know that NDB and *oE will have their own naming and discovery, and
the first set of IO tasks to be completed would be those using
(iscsi_cmd_t->cmd_flags & ICF_SCSI_DATA_SG_IO_CDB) in
iscsi_target_transport.c in the current code.These are single READ_*
and WRITE_* codepaths that perform DMA memory pre-proceessing in v2.9
LIO-SE. 

Also, being able to tell the engine to accelerate to DMA ring operation
(say to underlying struct scsi_device or struct block_device) instead of
fileio in some cases you will see better performance when using hardware
(ie: not a underlying kernel thread queueing IO into block).  But I have
found FILEIO with sendpage with MD to be faster in single threaded tests
than struct block_device.  I am currently using IBLOCK for LVM for core
LIO operation (which actually sits on software MD raid6).  I do this
because using submit_bio() with se_mem_t mapped arrays of struct
scatterlist -> struct bio_vec can handle power failures properly, and
not send back StatSN Acks to the Initiator who thinks that everything
has already made it to disk.  This is the case with doing IO to struct
file in the kernel today without a kernel level O_DIRECT.

Also for proper kernel-level target mode support, using struct file with
O_DIRECT for storage blocks and emulating control path CDBS is one of
the work items.  This can be made generic or obtained from the
underlying storage object (anything that can be exported from LIO
Subsystem TPI) For real hardware (struct scsi_device in just about all
the cases these days).  Last time I looked this was due to
fs/direct-io.c:dio_refill_pages() using get_user_pages()...

For really transport specific CDB and control code, which in good amount
of cases, we are going eventually be expected to emulate in software. 
I really like how STGT breaks this up into per device type code
segments; spc.c sbc.c mmc.c ssc.c smc.c etc.  Having all of these split
out properly is one strong point of STGT IMHO, and really makes learning
things much easier.  Also, being able to queue these IOs into a
userspace and receive a asynchronous response back up the storage stack.
I think this is actually a pretty interesting potential for passing
storage protocol packets into userspace apps and leave the protocol
state machines and recovery paths in the kernel with a generic target
engine.

Also, I know that the SCST folks have put alot of time into getting the
very SCSI hardware specific target mode control modes to work.  I
personally own a bunch of this adapters, and would really like to see
better support for target mode on non iSCSI type adapters with a single
target mode storage engine that abstracts storage subsystems and wire
protocol fabrics.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi"

Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Bart Van Assche

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin


James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Roland Dreier

 > Sorry, but I'm afraid you got this wrong. When the iSER transport is
 > used instead of TCP, all data is sent via RDMA, including unsolicited
 > data. If you have look at the iSER implementation in the Linux kernel
 > (source files under drivers/infiniband/ulp/iser), you will see that
 > all data is transferred via RDMA and not via TCP/IP.

I think the confusion here is caused by a slight misuse of the term
"RDMA".  It is true that all data is always transported over an
InfiniBand connection when iSER is used, but not all such transfers
are one-sided RDMA operations; some data can be transferred using
send/receive operations.

 - R.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Benny Halevy

On Feb. 06, 2008, 14:16 +0200, "Bart Van Assche" <[EMAIL PROTECTED]> wrote:
> On Feb 5, 2008 6:01 PM, Erez Zilber <[EMAIL PROTECTED]> wrote:
>> Using such large values for FirstBurstLength will give you poor
>> performance numbers for WRITE commands (with iSER). FirstBurstLength
>> means how much data should you send as unsolicited data (i.e. without
>> RDMA). It means that your WRITE commands were sent without RDMA.
> 
> Sorry, but I'm afraid you got this wrong. When the iSER transport is
> used instead of TCP, all data is sent via RDMA, including unsolicited
> data. If you have look at the iSER implementation in the Linux kernel
> (source files under drivers/infiniband/ulp/iser), you will see that
> all data is transferred via RDMA and not via TCP/IP.

Regardless of what the current implementation is, the behavior you (Bart)
describe seems to disagree with http://www.ietf.org/rfc/rfc5046.txt.

Benny

> 
> Bart Van Assche.
> -

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Jeff Garzik


Bart Van Assche wrote:

On Feb 5, 2008 6:50 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:

For remotely accessing data, iSCSI+fs is quite simply more overhead than
a networked fs.  With iSCSI you are doing

local VFS -> local blkdev -> network

whereas a networked filesystem is

local VFS -> network


There are use cases than can be solved better via iSCSI and a
filesystem than via a network filesystem. One such use case is when
deploying a virtual machine whose data is stored on a network server:
in that case there is only one user of the data (so there are no
locking issues) and filesystem and block device each run in another
operating system: the filesystem runs inside the virtual machine and
iSCSI either runs in the hypervisor or in the native OS.


Hence the diskless root fs configuration I referred to in multiple 
emails...  whoopee, you reinvented NFS root with quotas :)


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Bart Van Assche

On Feb 5, 2008 6:01 PM, Erez Zilber <[EMAIL PROTECTED]> wrote:
>
> Using such large values for FirstBurstLength will give you poor
> performance numbers for WRITE commands (with iSER). FirstBurstLength
> means how much data should you send as unsolicited data (i.e. without
> RDMA). It means that your WRITE commands were sent without RDMA.

Sorry, but I'm afraid you got this wrong. When the iSER transport is
used instead of TCP, all data is sent via RDMA, including unsolicited
data. If you have look at the iSER implementation in the Linux kernel
(source files under drivers/infiniband/ulp/iser), you will see that
all data is transferred via RDMA and not via TCP/IP.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Bart Van Assche

On Feb 5, 2008 6:50 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:
> For remotely accessing data, iSCSI+fs is quite simply more overhead than
> a networked fs.  With iSCSI you are doing
>
> local VFS -> local blkdev -> network
>
> whereas a networked filesystem is
>
> local VFS -> network

There are use cases than can be solved better via iSCSI and a
filesystem than via a network filesystem. One such use case is when
deploying a virtual machine whose data is stored on a network server:
in that case there is only one user of the data (so there are no
locking issues) and filesystem and block device each run in another
operating system: the filesystem runs inside the virtual machine and
iSCSI either runs in the hypervisor or in the native OS.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Wed, 2008-02-06 at 10:29 +0900, FUJITA Tomonori wrote:
> On Tue, 05 Feb 2008 18:09:15 +0100
> Matteo Tescione <[EMAIL PROTECTED]> wrote:
> 
> > On 5-02-2008 14:38, "FUJITA Tomonori" <[EMAIL PROTECTED]> wrote:
> > 
> > > On Tue, 05 Feb 2008 08:14:01 +0100
> > > Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> > > 
> > >> James Bottomley schrieb:
> > >> 
> > >>> These are both features being independently worked on, are they not?
> > >>> Even if they weren't, the combination of the size of SCST in kernel plus
> > >>> the problem of having to find a migration path for the current STGT
> > >>> users still looks to me to involve the greater amount of work.
> > >> 
> > >> I don't want to be mean, but does anyone actually use STGT in
> > >> production? Seriously?
> > >> 
> > >> In the latest development version of STGT, it's only possible to stop
> > >> the tgtd target daemon using KILL / 9 signal - which also means all
> > >> iSCSI initiator connections are corrupted when tgtd target daemon is
> > >> started again (kernel upgrade, target daemon upgrade, server reboot 
> > >> etc.).
> > > 
> > > I don't know what "iSCSI initiator connections are corrupted"
> > > mean. But if you reboot a server, how can an iSCSI target
> > > implementation keep iSCSI tcp connections?
> > > 
> > > 
> > >> Imagine you have to reboot all your NFS clients when you reboot your NFS
> > >> server. Not only that - your data is probably corrupted, or at least the
> > >> filesystem deserves checking...
> > 

The TCP connection will drop, remember that the TCP connection state for
one side has completely vanished.  Depending on iSCSI/iSER
ErrorRecoveryLevel that is set, this will mean:

1) Session Recovery, ERL=0 - Restarting the entire nexus and all
connections across all of the possible subnets or comm-links.  All
outstanding un-StatSN acknowledged commands will be returned back to the
SCSI subsystem with RETRY status.  Once a single connection has been
reestablished to start the nexus, the CDBs will be resent.

2) Connection Recovery, ERL=2 - CDBs from the failed connection(s) will
be retried (nothing changes in the PDU) to fill the iSCSI CmdSN ordering
gap, or be explictly retried with TMR TASK_REASSIGN for ones already
acknowledged by the ExpCmdSN that are returned to the initiator in
response packets or by way of unsolicited NopINs.

> > Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat)
> > rebooting the primary server doesn't affect my iscsi traffic, SCST correctly
> > manages stop/crash, by sending unit attention to clients on reconnect.
> > Drbd+heartbeat correctly manages those things too.
> > Still from an end-user POV, i was able to reboot/survive a crash only with
> > SCST, IETD still has reconnect problems and STGT are even worst.
> 
> Please tell us on stgt-devel mailing list if you see problems. We will
> try to fix them.
> 

FYI, the LIO code also supports rmmoding iscsi_target_mod while at full
10 Gb/sec speed.  I think it should be a requirement to be able to
control per initiator, per portal group, per LUN, per device, per HBA in
the design without restarting any other objects.

--nab

> Thanks,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Tue, 2008-02-05 at 16:11 -0800, Nicholas A. Bellinger wrote:
> On Tue, 2008-02-05 at 22:21 +0300, Vladislav Bolkhovitin wrote:
> > Jeff Garzik wrote:
> > >>> iSCSI is way, way too complicated. 
> > >>
> > >> I fully agree. From one side, all that complexity is unavoidable for 
> > >> case of multiple connections per session, but for the regular case of 
> > >> one connection per session it must be a lot simpler.
> > > 
> > > Actually, think about those multiple connections...  we already had to 
> > > implement fast-failover (and load bal) SCSI multi-pathing at a higher 
> > > level.  IMO that portion of the protocol is redundant:   You need the 
> > > same capability elsewhere in the OS _anyway_, if you are to support 
> > > multi-pathing.
> > 
> > I'm thinking about MC/S as about a way to improve performance using 
> > several physical links. There's no other way, except MC/S, to keep 
> > commands processing order in that case. So, it's really valuable 
> > property of iSCSI, although with a limited application.
> > 
> > Vlad
> > 
> 
> Greetings,
> 
> I have always observed the case with LIO SE/iSCSI target mode (as well
> as with other software initiators we can leave out of the discussion for
> now, and congrats to the open/iscsi on folks recent release. :-) that
> execution core hardware thread and inter-nexus per 1 Gb/sec ethernet
> port performance scales up to 4x and 2x core x86_64 very well with
> MC/S).  I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for
> a number of years with MC/S.  Using MC/S on 10 Gb/sec (on PCI-X v2.0
> 266mhz as well, which was the first transport that LIO Target ran on
> that was able to reach handle duplex ~1200 MB/sec with 3 initiators and
> MC/S.  In the point to point 10 GB/sec tests on IBM p404 machines, the
> initiators where able to reach ~910 MB/sec with MC/S.  Open/iSCSI was
> able to go a bit faster (~950 MB/sec) because it uses struct sk_buff
> directly. 
> 
 
Sorry, these where IBM p505 express (not p404, duh) which had a 2x
socket 2x core POWER5 setup.  These along with an IBM X-series machine)
where the only ones available for PCI-X v2.0, and this probably is still
the case. :-)

Also, these numbers where with a ~9000 MTU (I don't recall what the
hardware limit on the 10 Gb/sec switch lwas) doing direct struct iovec
to preallocated struct page mapping for payload on the target side.
This is known as RAMDISK_DR plugin in the LIO-SE.  On the initiator, LTP
disktest and O_DIRECT where used for direct to SCSI block device access.

I can big up this paper if anyone is interested.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Tue, 05 Feb 2008 18:09:15 +0100
Matteo Tescione <[EMAIL PROTECTED]> wrote:

> On 5-02-2008 14:38, "FUJITA Tomonori" <[EMAIL PROTECTED]> wrote:
> 
> > On Tue, 05 Feb 2008 08:14:01 +0100
> > Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> > 
> >> James Bottomley schrieb:
> >> 
> >>> These are both features being independently worked on, are they not?
> >>> Even if they weren't, the combination of the size of SCST in kernel plus
> >>> the problem of having to find a migration path for the current STGT
> >>> users still looks to me to involve the greater amount of work.
> >> 
> >> I don't want to be mean, but does anyone actually use STGT in
> >> production? Seriously?
> >> 
> >> In the latest development version of STGT, it's only possible to stop
> >> the tgtd target daemon using KILL / 9 signal - which also means all
> >> iSCSI initiator connections are corrupted when tgtd target daemon is
> >> started again (kernel upgrade, target daemon upgrade, server reboot etc.).
> > 
> > I don't know what "iSCSI initiator connections are corrupted"
> > mean. But if you reboot a server, how can an iSCSI target
> > implementation keep iSCSI tcp connections?
> > 
> > 
> >> Imagine you have to reboot all your NFS clients when you reboot your NFS
> >> server. Not only that - your data is probably corrupted, or at least the
> >> filesystem deserves checking...
> 
> Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat)
> rebooting the primary server doesn't affect my iscsi traffic, SCST correctly
> manages stop/crash, by sending unit attention to clients on reconnect.
> Drbd+heartbeat correctly manages those things too.
> Still from an end-user POV, i was able to reboot/survive a crash only with
> SCST, IETD still has reconnect problems and STGT are even worst.

Please tell us on stgt-devel mailing list if you see problems. We will
try to fix them.

Thanks,
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Tue, 2008-02-05 at 16:48 -0800, Nicholas A. Bellinger wrote:
> On Tue, 2008-02-05 at 22:01 +0300, Vladislav Bolkhovitin wrote:
> > Jeff Garzik wrote:
> > > Alan Cox wrote:
> > > 
> > >>>better. So for example, I personally suspect that ATA-over-ethernet is 
> > >>>way 
> > >>>better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> > >>>low-level, and against those crazy SCSI people to begin with.
> > >>
> > >>Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> > >>would probably trash iSCSI for latency if nothing else.
> > > 
> > > 
> > > AoE is truly a thing of beauty.  It has a two/three page RFC (say no 
> > > more!).
> > > 
> > > But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
> > > really do tagged queueing, etc.
> > > 
> > > 
> > > iSCSI is way, way too complicated. 
> > 
> > I fully agree. From one side, all that complexity is unavoidable for 
> > case of multiple connections per session, but for the regular case of 
> > one connection per session it must be a lot simpler.
> > 
> > And now think about iSER, which brings iSCSI on the whole new complexity 
> > level ;)
> 
> Actually, the iSER protocol wire protocol itself is quite simple,
> because it builds on iSCSI and IPS fundamentals, and because traditional
> iSCSI's recovery logic for CRC failures (and hence alot of
> acknowledgement sequence PDUs that go missing, etc) and the RDMA
> Capable
> Protocol (RCaP).

this should be:

.. and instead the RDMA Capacle Protocol (RCaP) provides the 32-bit or
greater data integrity.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Tue, 2008-02-05 at 22:01 +0300, Vladislav Bolkhovitin wrote:
> Jeff Garzik wrote:
> > Alan Cox wrote:
> > 
> >>>better. So for example, I personally suspect that ATA-over-ethernet is way 
> >>>better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> >>>low-level, and against those crazy SCSI people to begin with.
> >>
> >>Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> >>would probably trash iSCSI for latency if nothing else.
> > 
> > 
> > AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).
> > 
> > But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
> > really do tagged queueing, etc.
> > 
> > 
> > iSCSI is way, way too complicated. 
> 
> I fully agree. From one side, all that complexity is unavoidable for 
> case of multiple connections per session, but for the regular case of 
> one connection per session it must be a lot simpler.
> 
> And now think about iSER, which brings iSCSI on the whole new complexity 
> level ;)

Actually, the iSER protocol wire protocol itself is quite simple,
because it builds on iSCSI and IPS fundamentals, and because traditional
iSCSI's recovery logic for CRC failures (and hence alot of
acknowledgement sequence PDUs that go missing, etc) and the RDMA Capable
Protocol (RCaP).

The logic that iSER collectively disables is known as within-connection
and within-command recovery (negotiated as ErrorRecoveryLevel=1 on the
wire), RFC-5046 requires that the iSCSI layer that iSER is being enabled
to disable CRC32C checksums and any associated timeouts for ERL=1.

Also, have a look at Appendix A. in the iSER spec.

  A.1. iWARP Message Format for iSER Hello Message ...73
  A.2. iWARP Message Format for iSER HelloReply Message ..74
  A.3. iWARP Message Format for SCSI Read Command PDU 75
  A.4. iWARP Message Format for SCSI Read Data ...76
  A.5. iWARP Message Format for SCSI Write Command PDU ...77
  A.6. iWARP Message Format for RDMA Read Request 78
  A.7. iWARP Message Format for Solicited SCSI Write Data 79
  A.8. iWARP Message Format for SCSI Response PDU 80

This is about as 1/2 as many traditional iSCSI PDUs, that iSER
encapulates.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Tue, 2008-02-05 at 14:12 -0500, Jeff Garzik wrote:
> Vladislav Bolkhovitin wrote:
> > Jeff Garzik wrote:
> >> iSCSI is way, way too complicated. 
> > 
> > I fully agree. From one side, all that complexity is unavoidable for 
> > case of multiple connections per session, but for the regular case of 
> > one connection per session it must be a lot simpler.
> 
> 
> Actually, think about those multiple connections...  we already had to 
> implement fast-failover (and load bal) SCSI multi-pathing at a higher 
> level.  IMO that portion of the protocol is redundant:   You need the 
> same capability elsewhere in the OS _anyway_, if you are to support 
> multi-pathing.
> 
>   Jeff
> 
> 

Hey Jeff,

I put a whitepaper on the LIO cluster recently about this topic.. It is
from a few years ago but the datapoints are very relevant.

http://linux-iscsi.org/builds/user/nab/Inter.vs.OuterNexus.Multiplexing.pdf

The key advantage to MC/S and ERL=2 has always been that they are
completely OS independent.  They are designed to work together and
actually benefit from one another.

They are also are protocol independent between Traditional iSCSI and
iSER.

--nab

PS: A great thanks for my former colleague Edward Cheng for putting this
together.

> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Tue, 2008-02-05 at 22:21 +0300, Vladislav Bolkhovitin wrote:
> Jeff Garzik wrote:
> >>> iSCSI is way, way too complicated. 
> >>
> >> I fully agree. From one side, all that complexity is unavoidable for 
> >> case of multiple connections per session, but for the regular case of 
> >> one connection per session it must be a lot simpler.
> > 
> > Actually, think about those multiple connections...  we already had to 
> > implement fast-failover (and load bal) SCSI multi-pathing at a higher 
> > level.  IMO that portion of the protocol is redundant:   You need the 
> > same capability elsewhere in the OS _anyway_, if you are to support 
> > multi-pathing.
> 
> I'm thinking about MC/S as about a way to improve performance using 
> several physical links. There's no other way, except MC/S, to keep 
> commands processing order in that case. So, it's really valuable 
> property of iSCSI, although with a limited application.
> 
> Vlad
> 

Greetings,

I have always observed the case with LIO SE/iSCSI target mode (as well
as with other software initiators we can leave out of the discussion for
now, and congrats to the open/iscsi on folks recent release. :-) that
execution core hardware thread and inter-nexus per 1 Gb/sec ethernet
port performance scales up to 4x and 2x core x86_64 very well with
MC/S).  I have been seeing 450 MB/sec using 2x socket 4x core x86_64 for
a number of years with MC/S.  Using MC/S on 10 Gb/sec (on PCI-X v2.0
266mhz as well, which was the first transport that LIO Target ran on
that was able to reach handle duplex ~1200 MB/sec with 3 initiators and
MC/S.  In the point to point 10 GB/sec tests on IBM p404 machines, the
initiators where able to reach ~910 MB/sec with MC/S.  Open/iSCSI was
able to go a bit faster (~950 MB/sec) because it uses struct sk_buff
directly. 

A good rule to keep in mind here while considering performance is that
context switching overhead and pipeline <-> bus stalling (along with
other legacy OS specific storage stack limitations with BLOCK and VFS
with O_DIRECT, et al and I will leave out of the discussion for iSCSI
and SE engine target mode) is that a initiator will scale roughly 1/2 as
well as a target, given comparable hardware and virsh output.  The
software target case target case also depends, in great regard in many
cases, if we are talking about something something as simple as doing
contiguous DMA memory allocations in from a SINGLE kernel thread, and
handling direction execution to a storage hardware DMA ring that may
have not been allocated in the current kernel thread.  In MC/S mode this
breaks down to:

1) Sorting logic that handles pre execution statemachine for transport
from local RDMA memory and OS specific data buffers.   TCP application
data buffer, struct sk_buff, or RDMA struct page or SG.  This should be
generic between iSCSI and iSER.

2) Allocation of said memory buffers to OS subsystem dependent code that
can be queued up to these drivers.  It breaks down to what you can get
drivers and OS subsystem folks to agree to implement, and can be made
generic in a Transport / BLOCK / VFS layered storage stack.  In the
"allocate thread DMA ring and use OS supported software and vendor
available hardware" I don't think the kernel space requirement will
every completely be able to go away.

Without diving into RFC-3720 specifics, the statemachine for MC/S side
for memory allocation, login and logout generic to iSCSi and ISER, and
ERL=2 recovery.  My plan is to post the locations in the LIO code where
this has been implemented, and where we where can make this easier, etc.
In the early in the development of what eventually became LIO Target
code, ERL was broken into separete files and separete function
prefixes. 

iscsi_target_erl0, iscsi_target_erl1 and iscsi_target_erl2.

The statemachine for ERL=0 and ERL=2 is pretty simple in RFC-3720 (have
a look for those interested in the discussion)

7.1.1.  State Descriptions for Initiators and Targets

The LIO target code is also pretty simple for this:

[EMAIL PROTECTED] target]# wc -l iscsi_target_erl*
  1115 iscsi_target_erl0.c
45 iscsi_target_erl0.h
   526 iscsi_target_erl0.o
  1426 iscsi_target_erl1.c
51 iscsi_target_erl1.h
  1253 iscsi_target_erl1.o
   605 iscsi_target_erl2.c
45 iscsi_target_erl2.h
   447 iscsi_target_erl2.o
  5513 total

erl1.c is a bit larger than the others because it contains the MC/S
statemachine functions. iscsi_target_erl1.c:iscsi_execute_cmd() and
iscsi_target_util.c:iscsi_check_received_cmdsn() do most of the work for
LIO MC/S state machine.  I would  probably benefit from being in broken
up into say iscsi_target_mcs.c.  Note that all of this code is MC/S
safe, with the exception of the specific SCSI TMR functions.  For the
SCSI TMR pieces, I have always hoped to use SCST code for doing this...

Most of the login/logout code is done in iscsi_target.c, which is could
probably also benefit fot getting broken out...

--nab

-
To unsubscribe from this list: send the l

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:
iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


Actually, think about those multiple connections...  we already had to 
implement fast-failover (and load bal) SCSI multi-pathing at a higher 
level.  IMO that portion of the protocol is redundant:   You need the 
same capability elsewhere in the OS _anyway_, if you are to support 
multi-pathing.


I'm thinking about MC/S as about a way to improve performance using 
several physical links. There's no other way, except MC/S, to keep 
commands processing order in that case. So, it's really valuable 
property of iSCSI, although with a limited application.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread James Bottomley

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:
> >>Hmm, how can one write to an mmaped page and don't touch it?
> > 
> > I meant from user space ... the writes are done inside the kernel.
> 
> Sure, the mmap() approach agreed to be unpractical, but could you 
> elaborate more on this anyway, please? I'm just curious. Do you think 
> about implementing a new syscall, which would put pages with data in the 
> mmap'ed area?

No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.

> > However, as Linus has pointed out, this discussion is getting a bit off
> > topic. 
> 
> No, that isn't off topic. We've just proved that there is no good way to 
> implement zero-copy cached I/O for STGT. I see the only practical way 
> for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
> page cache in the user space. But will you like it?

Well, there's no real evidence that zero copy or lack of it is a problem
yet.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Jeff Garzik


Vladislav Bolkhovitin wrote:

Jeff Garzik wrote:
iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.



Actually, think about those multiple connections...  we already had to 
implement fast-failover (and load bal) SCSI multi-pathing at a higher 
level.  IMO that portion of the protocol is redundant:   You need the 
same capability elsewhere in the OS _anyway_, if you are to support 
multi-pathing.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

Erez Zilber wrote:

Bart Van Assche wrote:

As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.

Sorry for the late response (but better late than never).

One may claim that STGT should have lower performance than SCST because
its data path is from userspace. However, your results show that for
non-IB transports, they both show the same numbers. Furthermore, with IB
there shouldn't be any additional difference between the 2 targets
because data transfer from userspace is as efficient as data transfer
from kernel space.

And now consider if one target has zero-copy cached I/O. How much that
will improve its performance?

The only explanation that I see is that fine tuning for iSCSI & iSER is
required. As was already mentioned in this thread, with SDR you can get
~900 MB/sec with iSER (on STGT).

Erez

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

On Feb 5, 2008 6:10 PM, Erez Zilber <[EMAIL PROTECTED]> wrote:
> One may claim that STGT should have lower performance than SCST because
> its data path is from userspace. However, your results show that for
> non-IB transports, they both show the same numbers. Furthermore, with IB
> there shouldn't be any additional difference between the 2 targets
> because data transfer from userspace is as efficient as data transfer
> from kernel space.
>
> The only explanation that I see is that fine tuning for iSCSI & iSER is
> required. As was already mentioned in this thread, with SDR you can get
> ~900 MB/sec with iSER (on STGT).

My most recent measurements also show that one can get 900 MB/s with
STGT + iSER on an SDR IB network, but only for very large block sizes
(>= 100 MB). A quote from Linus Torvalds is relevant here (February 5,
2008):

Block transfer sizes over about 64kB are totally irrelevant for
99% of all people.

Please read my e-mail (posted earlier today) with a comparison for 4
KB - 64 KB block transfer sizes between SCST and STGT.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:

Alan Cox wrote:

better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.



AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).

But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
really do tagged queueing, etc.



iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


And now think about iSER, which brings iSCSI on the whole new complexity 
level ;)


It's an Internet protocol designed 
by storage designers, what do you expect?


For years I have been hoping that someone will invent a simple protocol 
(w/ strong auth) that can transit ATA and SCSI commands and responses. 
Heck, it would be almost trivial if the kernel had a TLS/SSL implementation.


Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:

I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem



.. not even shared. It was hard to get correct semantics full stop. 

Which is a traditional problem. The thing is, the kernel always has some 
internal state, and it's hard to expose all the semantics that the kernel 
knows about to user space.


So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?


The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having "file handles" (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.


Yes, there is something like that for SCSI target as well. It's a "local 
initiator" or "local nexus", see 
http://thread.gmane.org/gmane.linux.scsi/31288 and 
http://news.gmane.org/find-root.php?message_id=%3c463F36AC.3010207%40vlnb.net%3e 
for more info about that.


In fact, existence of local nexus is one more point why SCST is better, 
than STGT, because for STGT it's pretty hard to support it (all locally 
generated commands would have to be passed through its daemon, which 
would be a total disaster for performance), while for SCST it can be 
done relatively simply.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:
So just going by what has happened in the past, I'd assume that iSCSI 
would eventually turn into "connecting/authentication in user space" with 
"data transfers in kernel space".


This is exactly how iSCSI-SCST (iSCSI target driver for SCST) is 
implemented, credits to IET and Ardis target developers.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:



James Bottomley wrote:



On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:




James Bottomley wrote:



So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsi&m=120164008302435

and here:

http://marc.info/?l=linux-scsi&m=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O



from the backing store and then send this mmapped region off for target



I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size > size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 



Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.


James, just check and you will see, buffered I/O is a lot faster.


So in an out of memory situation the buffers you don't have are a lot
faster than the pages I don't have?


There isn't OOM in both cases. Just pages reclamation/readahead work 
much better in the buffered case.


So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.


I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. 


Err, to whom return? If you try to read from a mmaped page, which can't 
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
remember exactly. It's quite tricky to get back to the faulted command 
from the signal handler.


Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
think that such mapping/unmapping is good for performance?




msync does something
similar if there's a write failure.



You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?


Actually, just avoid touching it seems to do the trick with a recent
kernel.


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?



There's no actual evidence that copy problems are causing any
performatince issues issues for STGT.  In fact, there's evidence that
they're not for everything except IB networks.


The zero-copy cached I/O has not yet been implemented in SCST, I simply 
so far have not had time for that. Currently SCST performs better STGT, 
because of simpler processing path and less context switches per 
command. Memcpy() speed on modern systems is about t

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread James Bottomley

This email somehow didn't manage to make it to the list (I suspect
because it had html attachments).

James

---

  From: 
Julian Satran
<[EMAIL PROTECTED]>
To: 
Nicholas A. Bellinger
<[EMAIL PROTECTED]>
Cc: 
Andrew Morton
<[EMAIL PROTECTED]>, Alan
Cox <[EMAIL PROTECTED]>, Bart
Van Assche
<[EMAIL PROTECTED]>, FUJITA
Tomonori
<[EMAIL PROTECTED]>,
James Bottomley
<[EMAIL PROTECTED]>, ...
               Subject: 
Re: Integration of SCST in the
mainstream Linux kernel
  Date: 
Mon, 4 Feb 2008 21:31:48 -0500
(20:31 CST)


Well stated. In fact the "layers" above ethernet do provide the services 
that make the TCP/IP stack compelling - a whole complement of services.
ALL services required (naming, addressing, discovery, security etc.) will 
have to be recreated if you take the FcOE route. That makes good business 
for some but not necessary for the users. Those services BTW are not on 
the data path and are not "overhead".
The TCP/IP stack pathlength is decently low. What makes most 
implementations poor is that they where naively extended in the SMP world. 
Recent implementations (published) from IBM and Intel show excellent 
performance (4-6 times the regular stack). I do not have unfortunately 
latency numbers (as the community major stress has been throughput) but I 
assume that RDMA (not necessarily hardware RDMA) and/or the use of 
infiniband or latency critical applications - within clusters may be the 
ultimate low latency solution. Ethernet has some inherent latency issues 
(the bridges) that are inherited by anything on ethernet (FcOE included). 
The IP protocol stack is not inherently slow but some implementations are 
somewhat sluggish.
But instead of replacing them with new and half backed contraptions we 
would be all better of improving what we have and understand.

In the whole debate of around FcOE I heard a single argument that may have 
some merit - building convertors iSCSI-FCP to support legacy islands of 
FCP (read storage products that do not support iSCSI natively) is 
expensive. It is correct technically - only that FcOE eliminates an 
expense at the wrong end of the wire - it reduces the cost of the storage 
box at the expense of added cost at the server (and usually there a many 
servers using a storage box). FcOE vendors are also bound to provide FCP 
like services for FcOE - naming, security, discovery etc. - that do not 
exist on Ethernet. It is a good business for FcOE vendors - a duplicate 
set of solution for users.

It should be apparent by now that if one speaks about a "converged" 
network we should speak about an IP network and not about Ethernet.
If we take this route we might get perhaps also to an "infrastructure 
physical variants" that support very low latency better than ethernet and 
we might be able to use them with the same "stack" - a definite forward 
looking solution.

IMHO it is foolish to insist on throwing away the whole stack whenever we 
make a slight improvement in the physical layer of the network. We have a 
substantial investment and body of knowledge in the protocol stack and 
nothing proposed improves on it - obviously not as in its total level of 
service nor in performance.

Julo

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Linus Torvalds

On Tue, 5 Feb 2008, Bart Van Assche wrote:
> 
> Results that I did not expect:
> * A block transfer size of 1 MB is not enough to measure the maximal
> throughput. The maximal throughput is only reached at much higher
> block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT +
> iSER).

Block transfer sizes over about 64kB are totally irrelevant for 99% of all 
people.

Don't even bother testing anything more. Yes, bigger transfers happen, but 
a lot of common loads have *smaller* transfers than 64kB.

So benchmarks that try to find "theoretical throughput" by just making big 
transfers should just be banned. They give numbers, yes, but the numbers 
are pointless.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Jeff Garzik


Olivier Galibert wrote:

On Mon, Feb 04, 2008 at 05:57:47PM -0500, Jeff Garzik wrote:

iSCSI and NBD were passe ideas at birth.  :)

Networked block devices are attractive because the concepts and 
implementation are more simple than networked filesystems... but usually 
you want to run some sort of filesystem on top.  At that point you might 
as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your 
networked block device (and associated complexity).


Call me a sysadmin, but I find easier to plug in and keep in place an
ethernet cable than these parallel scsi cables from hell.  Every
server has at least two ethernet ports by default, with rarely any
surprises at the kernel level.  Adding ethernet cards is inexpensive,
and you pretty much never hear of compatibility problems between
cards.

So ethernet as a connection medium is really nice compared to scsi.
Too bad iscsi is demented and ATAoE/NBD inexistant.  Maybe external
SAS will be nice, but I don't see it getting to the level of
universality of ethernet any time soon.  And it won't get the same
amount of user-level compatibility testing in any case.


Indeed, at the end of the day iSCSI is a bloated cabling standard.  :)

It has its uses, but I don't see it as ever coming close to replacing 
direct-to-network (perhaps backed with local cachefs) filesystems... 
which is how all the hype comes across to me.


Cheap "Lintel" boxes everybody is familiar with _are_ the storage 
appliances.  Until mass-produced ATA and SCSI devices start shipping 
with ethernet connectors, anyway.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Jeff Garzik


Bart Van Assche wrote:

On Feb 4, 2008 11:57 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:


Networked block devices are attractive because the concepts and
implementation are more simple than networked filesystems... but usually
you want to run some sort of filesystem on top.  At that point you might
as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your
networked block device (and associated complexity).


Running a filesystem on top of iSCSI results in better performance
than NFS, especially if the NFS client conforms to the NFS standard
(=synchronous writes).
By searching the web search for the keywords NFS, iSCSI and
performance I found the following (6 years old) document:
http://www.technomagesinc.com/papers/ip_paper.html. A quote from the
conclusion:
Our results, generated by running some of industry standard benchmarks,
show that iSCSI significantly outperforms NFS for situations when
performing streaming, database like accesses and small file transactions.


async performs better than sync...  this is news?  Furthermore, NFSv4 
has not only async capability but delegation too (and RDMA if you like 
such things), so the comparison is not relevant to modern times.


But a networked filesystem (note I'm using that term, not "NFS", from 
here on) is simply far more useful to the average user.  A networked 
block device is a building block -- and a useful one.  A networked 
filesystem is an immediately usable solution.


For remotely accessing data, iSCSI+fs is quite simply more overhead than 
a networked fs.  With iSCSI you are doing


local VFS -> local blkdev -> network

whereas a networked filesystem is

local VFS -> network

iSCSI+fs also adds new manageability issues, because unless the 
filesystem is single-computer (such as diskless iSCSI root fs), you 
still need to go across the network _once again_ to handle filesystem 
locking and coordination issues.


There is no _fundamental_ reason why remote shared storage via iSCSI OSD 
 is any faster than a networked filesystem.



SCSI-over-IP has its uses.  Absolutely.  It needed to be standardized. 
But let's not pretend iSCSI is anything more than what it is.  Its a 
bloated cat5 cabling standard :)


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Matteo Tescione

On 5-02-2008 14:38, "FUJITA Tomonori" <[EMAIL PROTECTED]> wrote:

> On Tue, 05 Feb 2008 08:14:01 +0100
> Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> 
>> James Bottomley schrieb:
>> 
>>> These are both features being independently worked on, are they not?
>>> Even if they weren't, the combination of the size of SCST in kernel plus
>>> the problem of having to find a migration path for the current STGT
>>> users still looks to me to involve the greater amount of work.
>> 
>> I don't want to be mean, but does anyone actually use STGT in
>> production? Seriously?
>> 
>> In the latest development version of STGT, it's only possible to stop
>> the tgtd target daemon using KILL / 9 signal - which also means all
>> iSCSI initiator connections are corrupted when tgtd target daemon is
>> started again (kernel upgrade, target daemon upgrade, server reboot etc.).
> 
> I don't know what "iSCSI initiator connections are corrupted"
> mean. But if you reboot a server, how can an iSCSI target
> implementation keep iSCSI tcp connections?
> 
> 
>> Imagine you have to reboot all your NFS clients when you reboot your NFS
>> server. Not only that - your data is probably corrupted, or at least the
>> filesystem deserves checking...

Don't know if matters, but in my setup (iscsi on top of drbd+heartbeat)
rebooting the primary server doesn't affect my iscsi traffic, SCST correctly
manages stop/crash, by sending unit attention to clients on reconnect.
Drbd+heartbeat correctly manages those things too.
Still from an end-user POV, i was able to reboot/survive a crash only with
SCST, IETD still has reconnect problems and STGT are even worst.

Regards,
--matteo


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Erez Zilber

Bart Van Assche wrote:
> As you probably know there is a trend in enterprise computing towards
> networked storage. This is illustrated by the emergence during the
> past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
> (Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
> pieces of software are necessary to make networked storage possible:
> initiator software and target software. As far as I know there exist
> three different SCSI target implementations for Linux:
> - The iSCSI Enterprise Target Daemon (IETD,
> http://iscsitarget.sourceforge.net/);
> - The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
> - The Generic SCSI Target Middle Level for Linux project (SCST,
> http://scst.sourceforge.net/).
> Since I was wondering which SCSI target software would be best suited
> for an InfiniBand network, I started evaluating the STGT and SCST SCSI
> target implementations. Apparently the performance difference between
> STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
> but the SCST target software outperforms the STGT software on an
> InfiniBand network. See also the following thread for the details:
> http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.com&forum_name=scst-devel.
>
>   
Sorry for the late response (but better late than never).

One may claim that STGT should have lower performance than SCST because
its data path is from userspace. However, your results show that for
non-IB transports, they both show the same numbers. Furthermore, with IB
there shouldn't be any additional difference between the 2 targets
because data transfer from userspace is as efficient as data transfer
from kernel space.

The only explanation that I see is that fine tuning for iSCSI & iSER is
required. As was already mentioned in this thread, with SDR you can get
~900 MB/sec with iSER (on STGT).

Erez
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Erez Zilber

Bart Van Assche wrote:
> On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
>   
>> iSER has parameters to limit the maximum size of RDMA (it needs to
>> repeat RDMA with a poor configuration)?
>> 
>
> Please specify which parameters you are referring to. As you know I
> had already repeated my tests with ridiculously high values for the
> following iSER parameters: FirstBurstLength, MaxBurstLength and
> MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
> size specified to dd).
>
>   
Using such large values for FirstBurstLength will give you poor
performance numbers for WRITE commands (with iSER). FirstBurstLength
means how much data should you send as unsolicited data (i.e. without
RDMA). It means that your WRITE commands were sent without RDMA.

Erez
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Tue, 05 Feb 2008 17:07:07 +0100
Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:

> FUJITA Tomonori schrieb:
> > On Tue, 05 Feb 2008 08:14:01 +0100
> > Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> > 
> >> James Bottomley schrieb:
> >>
> >>> These are both features being independently worked on, are they not?
> >>> Even if they weren't, the combination of the size of SCST in kernel plus
> >>> the problem of having to find a migration path for the current STGT
> >>> users still looks to me to involve the greater amount of work.
> >> I don't want to be mean, but does anyone actually use STGT in
> >> production? Seriously?
> >>
> >> In the latest development version of STGT, it's only possible to stop
> >> the tgtd target daemon using KILL / 9 signal - which also means all
> >> iSCSI initiator connections are corrupted when tgtd target daemon is
> >> started again (kernel upgrade, target daemon upgrade, server reboot etc.).
> > 
> > I don't know what "iSCSI initiator connections are corrupted"
> > mean. But if you reboot a server, how can an iSCSI target
> > implementation keep iSCSI tcp connections?
> 
> The problem with tgtd is that you can't start it (configured) in an
> "atomic" way.
> Usually, one will start tgtd and it's configuration in a script (I 
> replaced some parameters with "..." to make it shorter and more readable):

Thanks for the details. So the way to stop the daemon is not related
with your problem.

It's easily fixable. Can you start a new thread about this on
stgt-devel mailing list? When we agree on the interface to start the
daemon, I'll implement it.

> tgtd
> tgtadm --op new ...
> tgtadm --lld iscsi --op new ...

(snip)

> So the only way to start/restart tgtd reliably is to do hacks which are 
> needed with yet another iSCSI kernel implementation (IET): use iptables.
> 
> iptables 
> tgtd
> sleep 1
> tgtadm --op new ...
> tgtadm --lld iscsi --op new ...
> iptables 
> 
> 
> A bit ugly, isn't it?
> Having to tinker with a firewall in order to start a daemon is by no 
> means a sign of a well-tested and mature project.
> 
> That's why I asked how many people use stgt in a production environment 
> - James was worried about a potential migration path for current users.

I don't know how many people use stgt in a production environment but
I'm not sure that this problem prevents many people from using it in a
production environment.

You want to reboot a server running target devices while initiators
connect to it. Rebooting the target server behind the initiators
seldom works. System adminstorators in my workplace reboot storage
devices once a year and tell us to shut down the initiator machines
that use them before that.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

Regarding the performance tests I promised to perform: although until
now I only have been able to run two tests (STGT + iSER versus SCST +
SRP), the results are interesting. I will run the remaining test cases
during the next days.

About the test setup: dd and xdd were used to transfer 2 GB of data
between an initiator system and a target system via direct I/O over an
SDR InfiniBand network (1GB/s). The block size varied between 512
bytes and 1 GB, but was always a power of two.

Expected results:
* The measurement results are consistent with the numbers I published earlier.
* During data transfers all data is transferred in blocks between 4 KB
and 32 KB in size (according to the SCST statistics).
* For small and medium block sizes (<= 32 KB) transfer times can be
modeled very well by the following formula: (transfer time) = (setup
latency) + (bytes transferred)/(bandwidth). The correlation numbers
are very close to one.
* The latency and bandwidth parameters depend on the test tool (dd
versus xdd), on the kind of test performed (reading versus writing),
on the SCSI target and on the communication protocol.
* When using RDMA (iSER or SRP), SCST has a lower latency and higher
bandwidth than STGT (results from linear regression for block sizes <=
32 KB):
   Test  Latency(us) Bandwidth (MB/s) Correlation
   STGT+iSER, read, dd   64  560  0.95
   STGT+iSER, read, xdd  65  556  0.94
   STGT+iSER, write, dd  53  394  0.71
   STGT+iSER, write, xdd 54  445  0.59
   SCST+SRP, read, dd39  657  0.83
   SCST+SRP, read, xdd   41  668  0.87
   SCST+SRP, write, dd   52  449  0.62
   SCST+SRP, write, xdd  52  516  0.77

Results that I did not expect:
* A block transfer size of 1 MB is not enough to measure the maximal
throughput. The maximal throughput is only reached at much higher
block sizes (about 10 MB for SCST + SRP and about 100 MB for STGT +
iSER).
* There is one case where dd and xdd results are inconsistent: when
reading via SCST + SRP and for block sizes of about 1 MB.
* For block sizes > 64 KB the measurements differ from the model. This
is probably because all initiator-target transfers happen in blocks of
32 KB or less.

For the details and some graphs, see also
http://software.qlayer.com/display/iSCSI/Measurements .

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Ming Zhang

On Tue, 2008-02-05 at 17:07 +0100, Tomasz Chmielewski wrote:
> FUJITA Tomonori schrieb:
> > On Tue, 05 Feb 2008 08:14:01 +0100
> > Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:
> > 
> >> James Bottomley schrieb:
> >>
> >>> These are both features being independently worked on, are they not?
> >>> Even if they weren't, the combination of the size of SCST in kernel plus
> >>> the problem of having to find a migration path for the current STGT
> >>> users still looks to me to involve the greater amount of work.
> >> I don't want to be mean, but does anyone actually use STGT in
> >> production? Seriously?
> >>
> >> In the latest development version of STGT, it's only possible to stop
> >> the tgtd target daemon using KILL / 9 signal - which also means all
> >> iSCSI initiator connections are corrupted when tgtd target daemon is
> >> started again (kernel upgrade, target daemon upgrade, server reboot etc.).
> > 
> > I don't know what "iSCSI initiator connections are corrupted"
> > mean. But if you reboot a server, how can an iSCSI target
> > implementation keep iSCSI tcp connections?
> 
> The problem with tgtd is that you can't start it (configured) in an
> "atomic" way.
> Usually, one will start tgtd and it's configuration in a script (I 
> replaced some parameters with "..." to make it shorter and more readable):
> 
> 
> tgtd
> tgtadm --op new ...
> tgtadm --lld iscsi --op new ...
> 
> 
> However, this won't work - tgtd goes immediately in the background as it 
> is still starting, and the first tgtadm commands will fail:

this should be a easy fix. start tgtd, get port setup ready in forked
process, then signal its parent that ready to quit. or set port ready in
parent, fork and pass to daemon.


> 
> # bash -x tgtd-start
> + tgtd
> + tgtadm --op new --mode target ...
> tgtadm: can't connect to the tgt daemon, Connection refused
> tgtadm: can't send the request to the tgt daemon, Transport endpoint is 
> not connected
> + tgtadm --lld iscsi --op new --mode account ...
> tgtadm: can't connect to the tgt daemon, Connection refused
> tgtadm: can't send the request to the tgt daemon, Transport endpoint is 
> not connected
> + tgtadm --lld iscsi --op bind --mode account --tid 1 ...
> tgtadm: can't find the target
> + tgtadm --op new --mode logicalunit --tid 1 --lun 1 ...
> tgtadm: can't find the target
> + tgtadm --op bind --mode target --tid 1 -I ALL
> tgtadm: can't find the target
> + tgtadm --op new --mode target --tid 2 ...
> + tgtadm --op new --mode logicalunit --tid 2 --lun 1 ...
> + tgtadm --op bind --mode target --tid 2 -I ALL
> 
> 
> OK, if tgtd takes longer to start, perhaps it's a good idea to sleep a 
> second right after tgtd?
> 
> tgtd
> sleep 1
> tgtadm --op new ...
> tgtadm --lld iscsi --op new ...
> 
> 
> No, it is not a good idea - if tgtd listens on port 3260 *and* is 
> unconfigured yet,  any reconnecting initiator will fail, like below:

this is another easy fix. tgtd started with unconfigured status and then
a tgtadm can configure it and turn it into ready status.


those are really minor usability issue. ( i know it is painful for user,
i agree)


the major problem here is to discuss in architectural wise, which one is
better... linux kernel should have one implementation that is good from
foundation...





> 
> end_request: I/O error, dev sdb, sector 7045192
> Buffer I/O error on device sdb, logical block 880649
> lost page write due to I/O error on sdb
> Aborting journal on device sdb.
> ext3_abort called.
> EXT3-fs error (device sdb): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
> end_request: I/O error, dev sdb, sector 7045880
> Buffer I/O error on device sdb, logical block 880735
> lost page write due to I/O error on sdb
> end_request: I/O error, dev sdb, sector 6728
> Buffer I/O error on device sdb, logical block 841
> lost page write due to I/O error on sdb
> end_request: I/O error, dev sdb, sector 7045192
> Buffer I/O error on device sdb, logical block 880649
> lost page write due to I/O error on sdb
> end_request: I/O error, dev sdb, sector 7045880
> Buffer I/O error on device sdb, logical block 880735
> lost page write due to I/O error on sdb
> __journal_remove_journal_head: freeing b_frozen_data
> __journal_remove_journal_head: freeing b_frozen_data
> 
> 
> Ouch.
> 
> So the only way to start/restart tgtd reliably is to do hacks which are 
> needed with yet another iSCSI kernel implementation (IET): use iptables.
> 
> iptables 
> tgtd
> sleep 1
> tgtadm --op new ...
> tgtadm --lld iscsi --op new ...
> iptables 
> 
> 
> A bit ugly, isn't it?
> Having to tinker with a firewall in order to start a daemon is by no 
> means a sign of a well-tested and mature project.
> 
> That's why I asked how many people use stgt in a production environment 
> - James was worried about a potential migration path for current users.
> 
> 
> 
> -- 
> Tomasz Chmielewski
> http://wpkg.org
> 
> 
> ---

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Tomasz Chmielewski


FUJITA Tomonori schrieb:

On Tue, 05 Feb 2008 08:14:01 +0100
Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:


James Bottomley schrieb:


These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.

I don't want to be mean, but does anyone actually use STGT in
production? Seriously?

In the latest development version of STGT, it's only possible to stop
the tgtd target daemon using KILL / 9 signal - which also means all
iSCSI initiator connections are corrupted when tgtd target daemon is
started again (kernel upgrade, target daemon upgrade, server reboot etc.).


I don't know what "iSCSI initiator connections are corrupted"
mean. But if you reboot a server, how can an iSCSI target
implementation keep iSCSI tcp connections?


The problem with tgtd is that you can't start it (configured) in an
"atomic" way.
Usually, one will start tgtd and it's configuration in a script (I 
replaced some parameters with "..." to make it shorter and more readable):



tgtd
tgtadm --op new ...
tgtadm --lld iscsi --op new ...


However, this won't work - tgtd goes immediately in the background as it 
is still starting, and the first tgtadm commands will fail:


# bash -x tgtd-start
+ tgtd
+ tgtadm --op new --mode target ...
tgtadm: can't connect to the tgt daemon, Connection refused
tgtadm: can't send the request to the tgt daemon, Transport endpoint is 
not connected

+ tgtadm --lld iscsi --op new --mode account ...
tgtadm: can't connect to the tgt daemon, Connection refused
tgtadm: can't send the request to the tgt daemon, Transport endpoint is 
not connected

+ tgtadm --lld iscsi --op bind --mode account --tid 1 ...
tgtadm: can't find the target
+ tgtadm --op new --mode logicalunit --tid 1 --lun 1 ...
tgtadm: can't find the target
+ tgtadm --op bind --mode target --tid 1 -I ALL
tgtadm: can't find the target
+ tgtadm --op new --mode target --tid 2 ...
+ tgtadm --op new --mode logicalunit --tid 2 --lun 1 ...
+ tgtadm --op bind --mode target --tid 2 -I ALL


OK, if tgtd takes longer to start, perhaps it's a good idea to sleep a 
second right after tgtd?


tgtd
sleep 1
tgtadm --op new ...
tgtadm --lld iscsi --op new ...


No, it is not a good idea - if tgtd listens on port 3260 *and* is 
unconfigured yet,  any reconnecting initiator will fail, like below:


end_request: I/O error, dev sdb, sector 7045192
Buffer I/O error on device sdb, logical block 880649
lost page write due to I/O error on sdb
Aborting journal on device sdb.
ext3_abort called.
EXT3-fs error (device sdb): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
end_request: I/O error, dev sdb, sector 7045880
Buffer I/O error on device sdb, logical block 880735
lost page write due to I/O error on sdb
end_request: I/O error, dev sdb, sector 6728
Buffer I/O error on device sdb, logical block 841
lost page write due to I/O error on sdb
end_request: I/O error, dev sdb, sector 7045192
Buffer I/O error on device sdb, logical block 880649
lost page write due to I/O error on sdb
end_request: I/O error, dev sdb, sector 7045880
Buffer I/O error on device sdb, logical block 880735
lost page write due to I/O error on sdb
__journal_remove_journal_head: freeing b_frozen_data
__journal_remove_journal_head: freeing b_frozen_data


Ouch.

So the only way to start/restart tgtd reliably is to do hacks which are 
needed with yet another iSCSI kernel implementation (IET): use iptables.


iptables 
tgtd
sleep 1
tgtadm --op new ...
tgtadm --lld iscsi --op new ...
iptables 


A bit ugly, isn't it?
Having to tinker with a firewall in order to start a daemon is by no 
means a sign of a well-tested and mature project.


That's why I asked how many people use stgt in a production environment 
- James was worried about a potential migration path for current users.




--
Tomasz Chmielewski
http://wpkg.org

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008 20:07:01 -0600
"Chris Weiss" <[EMAIL PROTECTED]> wrote:

> On Feb 4, 2008 11:30 AM, Douglas Gilbert <[EMAIL PROTECTED]> wrote:
> > Alan Cox wrote:
> > >> better. So for example, I personally suspect that ATA-over-ethernet is 
> > >> way
> > >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
> > >> low-level, and against those crazy SCSI people to begin with.
> > >
> > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> > > would probably trash iSCSI for latency if nothing else.
> >
> > And a variant that doesn't do ATA or IP:
> > http://www.fcoe.com/
> >
> 
> however, and interestingly enough, the open-fcoe software target
> depends on scst (for now anyway)

STGT also supports software FCoE target driver though it's still
experimental stuff.

http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg12705.html

It works in user space like STGT's iSCSI (and iSER) target driver
(i.e. no kernel/user space interaction).
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Tue, 05 Feb 2008 05:43:10 +0100
Matteo Tescione <[EMAIL PROTECTED]> wrote:

> Hi all,
> And sorry for intrusion, i am not a developer but i work everyday with iscsi
> and i found it fantastic.
> Altough Aoe, Fcoe and so on could be better, we have to look in real world
> implementations what is needed *now*, and if we look at vmware world,
> virtual iron, microsoft clustering etc, the answer is iSCSI.
> And now, SCST is the best open-source iSCSI target. So, from an end-user
> point of view, what are the really problems to not integrate scst in the
> mainstream kernel?

Currently, the best open-source iSCSI target implemenation in Linux is
Nicholas's LIO, I guess.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

On Tue, 05 Feb 2008 08:14:01 +0100
Tomasz Chmielewski <[EMAIL PROTECTED]> wrote:

> James Bottomley schrieb:
> 
> > These are both features being independently worked on, are they not?
> > Even if they weren't, the combination of the size of SCST in kernel plus
> > the problem of having to find a migration path for the current STGT
> > users still looks to me to involve the greater amount of work.
> 
> I don't want to be mean, but does anyone actually use STGT in
> production? Seriously?
> 
> In the latest development version of STGT, it's only possible to stop
> the tgtd target daemon using KILL / 9 signal - which also means all
> iSCSI initiator connections are corrupted when tgtd target daemon is
> started again (kernel upgrade, target daemon upgrade, server reboot etc.).

I don't know what "iSCSI initiator connections are corrupted"
mean. But if you reboot a server, how can an iSCSI target
implementation keep iSCSI tcp connections?


> Imagine you have to reboot all your NFS clients when you reboot your NFS
> server. Not only that - your data is probably corrupted, or at least the
> filesystem deserves checking...
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Olivier Galibert

On Mon, Feb 04, 2008 at 05:57:47PM -0500, Jeff Garzik wrote:
> iSCSI and NBD were passe ideas at birth.  :)
> 
> Networked block devices are attractive because the concepts and 
> implementation are more simple than networked filesystems... but usually 
> you want to run some sort of filesystem on top.  At that point you might 
> as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your 
> networked block device (and associated complexity).

Call me a sysadmin, but I find easier to plug in and keep in place an
ethernet cable than these parallel scsi cables from hell.  Every
server has at least two ethernet ports by default, with rarely any
surprises at the kernel level.  Adding ethernet cards is inexpensive,
and you pretty much never hear of compatibility problems between
cards.

So ethernet as a connection medium is really nice compared to scsi.
Too bad iscsi is demented and ATAoE/NBD inexistant.  Maybe external
SAS will be nice, but I don't see it getting to the level of
universality of ethernet any time soon.  And it won't get the same
amount of user-level compatibility testing in any case.

  OG.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-05 Thread Bart Van Assche

On Feb 4, 2008 11:57 PM, Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Networked block devices are attractive because the concepts and
> implementation are more simple than networked filesystems... but usually
> you want to run some sort of filesystem on top.  At that point you might
> as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your
> networked block device (and associated complexity).

Running a filesystem on top of iSCSI results in better performance
than NFS, especially if the NFS client conforms to the NFS standard
(=synchronous writes).
By searching the web search for the keywords NFS, iSCSI and
performance I found the following (6 years old) document:
http://www.technomagesinc.com/papers/ip_paper.html. A quote from the
conclusion:
Our results, generated by running some of industry standard benchmarks,
show that iSCSI significantly outperforms NFS for situations when
performing streaming, database like accesses and small file transactions.

Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Tomasz Chmielewski


James Bottomley schrieb:


These are both features being independently worked on, are they not?
Even if they weren't, the combination of the size of SCST in kernel plus
the problem of having to find a migration path for the current STGT
users still looks to me to involve the greater amount of work.


I don't want to be mean, but does anyone actually use STGT in
production? Seriously?

In the latest development version of STGT, it's only possible to stop
the tgtd target daemon using KILL / 9 signal - which also means all
iSCSI initiator connections are corrupted when tgtd target daemon is
started again (kernel upgrade, target daemon upgrade, server reboot etc.).

Imagine you have to reboot all your NFS clients when you reboot your NFS
server. Not only that - your data is probably corrupted, or at least the
filesystem deserves checking...


--
Tomasz Chmielewski
http://wpkg.org



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel


On Tue, 2008-02-05 at 05:43 +0100, Matteo Tescione wrote:
> Hi all,
> And sorry for intrusion, i am not a developer but i work everyday with iscsi
> and i found it fantastic.
> Altough Aoe, Fcoe and so on could be better, we have to look in real world
> implementations what is needed *now*, and if we look at vmware world,
> virtual iron, microsoft clustering etc, the answer is iSCSI.
> And now, SCST is the best open-source iSCSI target. So, from an end-user
> point of view, what are the really problems to not integrate scst in the
> mainstream kernel?

The fact that your last statement is conjecture.  It's definitely untrue
for non-IB networks, and the jury is still out on IB networks.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Matteo Tescione

Hi all,
And sorry for intrusion, i am not a developer but i work everyday with iscsi
and i found it fantastic.
Altough Aoe, Fcoe and so on could be better, we have to look in real world
implementations what is needed *now*, and if we look at vmware world,
virtual iron, microsoft clustering etc, the answer is iSCSI.
And now, SCST is the best open-source iSCSI target. So, from an end-user
point of view, what are the really problems to not integrate scst in the
mainstream kernel?

Just my two cent,
--
So long and thank for all the fish
--
#Matteo Tescione
#RMnet srl


> 
> 
> On Mon, 4 Feb 2008, Matt Mackall wrote:
>> 
>> But ATAoE is boring because it's not IP. Which means no routing,
>> firewalls, tunnels, congestion control, etc.
> 
> The thing is, that's often an advantage. Not just for performance.
> 
>> NBD and iSCSI (for all its hideous growths) can take advantage of these
>> things.
> 
> .. and all this could equally well be done by a simple bridging protocol
> (completely independently of any AoE code).
> 
> The thing is, iSCSI does things at the wrong level. It *forces* people to
> use the complex protocols, when it's a known that a lot of people don't
> want it. 
> 
> Which is why these AoE and FCoE things keep popping up.
> 
> It's easy to bridge ethernet and add a new layer on top of AoE if you need
> it. In comparison, it's *impossible* to remove an unnecessary layer from
> iSCSI.
> 
> This is why "simple and low-level is good". It's always possible to build
> on top of low-level protocols, while it's generally never possible to
> simplify overly complex ones.
> 
> Linus
> 
> -
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
> ___
> Scst-devel mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/scst-devel
> 


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Chris Weiss

On Feb 4, 2008 11:30 AM, Douglas Gilbert <[EMAIL PROTECTED]> wrote:
> Alan Cox wrote:
> >> better. So for example, I personally suspect that ATA-over-ethernet is way
> >> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
> >> low-level, and against those crazy SCSI people to begin with.
> >
> > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> > would probably trash iSCSI for latency if nothing else.
>
> And a variant that doesn't do ATA or IP:
> http://www.fcoe.com/
>

however, and interestingly enough, the open-fcoe software target
depends on scst (for now anyway)
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, Jeff Garzik wrote:
> 
> Both of these are easily handled if the server is 100% in charge of managing
> the filesystem _metadata_ and data.  That's what I meant by complete control.
> 
> i.e. it not ext3 or reiserfs or vfat, its a block device or 1000GB file
> managed by a userland process.

Oh ok.

Yes, if you bring the filesystem into user mode too, then the problems go 
away - because now your NFSD can interact directly with the filesystem 
without any kernel/usermode abstraction layer rules in between. So that 
has all the same properties as moving NFSD entirely into the kernel.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Matt Mackall

On Mon, 2008-02-04 at 16:24 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, Matt Mackall wrote:
> > 
> > But ATAoE is boring because it's not IP. Which means no routing,
> > firewalls, tunnels, congestion control, etc.
> 
> The thing is, that's often an advantage. Not just for performance.
> 
> > NBD and iSCSI (for all its hideous growths) can take advantage of these
> > things.
> 
> .. and all this could equally well be done by a simple bridging protocol 
> (completely independently of any AoE code).
> 
> The thing is, iSCSI does things at the wrong level. It *forces* people to 
> use the complex protocols, when it's a known that a lot of people don't 
> want it. 

I frankly think NBD is at a pretty comfortable level. It's internally
very simple (and hardware-agnostic). And moderately easy to do in
silicon.

But I'm not going to defend iSCSI. I worked on the first implementation
(what became the Cisco iSCSI driver) and I have no love for iSCSI at
all. It should have been (and started out as) a nearly trivial
encapsulation of SCSI over TCP much like ATA over Ethernet but quickly
lost the plot when committees got ahold of it.

-- 
Mathematics is the supreme nostalgia of our time.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:

On Mon, 4 Feb 2008, Matt Mackall wrote:

But ATAoE is boring because it's not IP. Which means no routing,
firewalls, tunnels, congestion control, etc.


The thing is, that's often an advantage. Not just for performance.


NBD and iSCSI (for all its hideous growths) can take advantage of these
things.


.. and all this could equally well be done by a simple bridging protocol 
(completely independently of any AoE code).


The thing is, iSCSI does things at the wrong level. It *forces* people to 
use the complex protocols, when it's a known that a lot of people don't 
want it. 

Which is why these AoE and FCoE things keep popping up. 

It's easy to bridge ethernet and add a new layer on top of AoE if you need 
it. In comparison, it's *impossible* to remove an unnecessary layer from 
iSCSI.


This is why "simple and low-level is good". It's always possible to build 
on top of low-level protocols, while it's generally never possible to 
simplify overly complex ones.


Never discount "easy" and "just works", which is what IP (and TCP) gives 
you...


Sure you can use a bridging protocol and all that jazz, but I wager, to 
a network admin yet-another-IP-application is easier to evaluate, deploy 
and manage on existing networks.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, Matt Mackall wrote:
> 
> But ATAoE is boring because it's not IP. Which means no routing,
> firewalls, tunnels, congestion control, etc.

The thing is, that's often an advantage. Not just for performance.

> NBD and iSCSI (for all its hideous growths) can take advantage of these
> things.

.. and all this could equally well be done by a simple bridging protocol 
(completely independently of any AoE code).

The thing is, iSCSI does things at the wrong level. It *forces* people to 
use the complex protocols, when it's a known that a lot of people don't 
want it. 

Which is why these AoE and FCoE things keep popping up. 

It's easy to bridge ethernet and add a new layer on top of AoE if you need 
it. In comparison, it's *impossible* to remove an unnecessary layer from 
iSCSI.

This is why "simple and low-level is good". It's always possible to build 
on top of low-level protocols, while it's generally never possible to 
simplify overly complex ones.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Matt Mackall


On Mon, 2008-02-04 at 22:43 +, Alan Cox wrote:
> > better. So for example, I personally suspect that ATA-over-ethernet is way 
> > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> > low-level, and against those crazy SCSI people to begin with.
> 
> Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> would probably trash iSCSI for latency if nothing else.

But ATAoE is boring because it's not IP. Which means no routing,
firewalls, tunnels, congestion control, etc.

NBD and iSCSI (for all its hideous growths) can take advantage of these
things.

-- 
Mathematics is the supreme nostalgia of our time.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:


On Mon, 4 Feb 2008, Jeff Garzik wrote:

Well, speaking as a complete nutter who just finished the bare bones of an
NFSv4 userland server[1]...  it depends on your approach.


You definitely are a complete nutter ;)


If the userland server is the _only_ one accessing the data[2] -- i.e. the
database server model where ls(1) shows a couple multi-gigabyte files or a raw
partition -- then it's easy to get all the semantics right, including file
handles.  You're not racing with local kernel fileserving.


It's not really simple in general even then. The problems come with file 
handles, and two big issues in particular:


 - handling a reboot (of the server) without impacting the client really 
   does need a "look up by file handle" operation (which you can do by 
   logging the pathname to filehandle translation, but it certainly gets 
   problematic).


 - non-Unix-like filesystems don't necessarily have a stable "st_ino" 
   field (ie it may change over a rename or have no meaning what-so-ever, 
   things like that), and that makes trying to generate a filehandle 
   really interesting for them.


Both of these are easily handled if the server is 100% in charge of 
managing the filesystem _metadata_ and data.  That's what I meant by 
complete control.


i.e. it not ext3 or reiserfs or vfat, its a block device or 1000GB file 
managed by a userland process.


Doing it that way gives one a bit more freedom to tune the filesystem 
format directly.  Stable inode numbers and filehandles are just easy as 
they are with ext3.  I'm the filesystem format designer, after all. (run 
for your lives...)


You do wind up having to roll your own dcache in userspace, though.

A matter of taste in implementation, but it is not difficult...  I've 
certainly never been accused of having good taste :)



I do agree that it's possible - we obviously _did_ have a user-level NFSD 
for a long while, after all - but it's quite painful if you want to handle 
things well. Only allowing access through the NFSD certainly helps a lot, 
but still doesn't make it quite as trivial as you claim ;)


Nah, you're thinking about something different:  a userland NFSD 
competing with other userland processes for access to the same files, 
while the kernel ultimately manages the filesystem metadata.  Recipe for 
races and inequities, and it's good we moved away from that.


I'm talking about where a userland process manages the filesystem 
metadata too.  In a filesystem with a million files, ls(1) on the server 
will only show a single file:


[EMAIL PROTECTED] ~]$ ls -l /spare/fileserver-data/
total 70657116
-rw-r--r-- 1 jgarzik jgarzik 1818064825 2007-12-29 06:40 fsimage.1



Of course, I think you can make NFSv4 to use volatile filehandles instead 
of the traditional long-lived ones, and that really should avoid almost 
all of the problems with doing a NFSv4 server in user space. However, I'd 
expect there to be clients that don't do the whole volatile thing, or 
support the file handle becoming stale only at certain well-defined points 
(ie after renames, not at random reboot times).


Don't get me started on "volatile" versus "persistent" filehandles in 
NFSv4...  groan.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, Jeff Garzik wrote:
> 
> Well, speaking as a complete nutter who just finished the bare bones of an
> NFSv4 userland server[1]...  it depends on your approach.

You definitely are a complete nutter ;)

> If the userland server is the _only_ one accessing the data[2] -- i.e. the
> database server model where ls(1) shows a couple multi-gigabyte files or a raw
> partition -- then it's easy to get all the semantics right, including file
> handles.  You're not racing with local kernel fileserving.

It's not really simple in general even then. The problems come with file 
handles, and two big issues in particular:

 - handling a reboot (of the server) without impacting the client really 
   does need a "look up by file handle" operation (which you can do by 
   logging the pathname to filehandle translation, but it certainly gets 
   problematic).

 - non-Unix-like filesystems don't necessarily have a stable "st_ino" 
   field (ie it may change over a rename or have no meaning what-so-ever, 
   things like that), and that makes trying to generate a filehandle 
   really interesting for them.

I do agree that it's possible - we obviously _did_ have a user-level NFSD 
for a long while, after all - but it's quite painful if you want to handle 
things well. Only allowing access through the NFSD certainly helps a lot, 
but still doesn't make it quite as trivial as you claim ;)

Of course, I think you can make NFSv4 to use volatile filehandles instead 
of the traditional long-lived ones, and that really should avoid almost 
all of the problems with doing a NFSv4 server in user space. However, I'd 
expect there to be clients that don't do the whole volatile thing, or 
support the file handle becoming stale only at certain well-defined points 
(ie after renames, not at random reboot times).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Douglas Gilbert


Alan Cox wrote:
better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.


And a variant that doesn't do ATA or IP:
http://www.fcoe.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, Jeff Garzik wrote:
> 
> For years I have been hoping that someone will invent a simple protocol (w/
> strong auth) that can transit ATA and SCSI commands and responses. Heck, it
> would be almost trivial if the kernel had a TLS/SSL implementation.

Why would you want authorization? If you don't use IP (just ethernet 
framing), then 99% of the time the solution is to just trust the subnet. 

So most people would never want TLS/SSL, and the ones that *do* want it 
would probably also want IP routing, so you'd actually be better off with 
a separate higher-level bridging protocol rather than have TLS/SSL as part 
of the actual packet protocol.

So don't add complexity. The beauty of ATA-over-ethernet is exactly that 
it's simple and straightforward.

(Simple and straightforward is also nice for actually creating devices 
that are the targets of this. I just *bet* that an iSCSI target device 
probably needs two orders of magnitude more CPU power than a simple AoE 
thing that can probably be done in an FPGA with no real software at all).

Whatever. We have now officially gotten totally off topic ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 15:12 -0800, Nicholas A. Bellinger wrote:
> On Mon, 2008-02-04 at 17:00 -0600, James Bottomley wrote:
> > On Mon, 2008-02-04 at 22:43 +, Alan Cox wrote:
> > > > better. So for example, I personally suspect that ATA-over-ethernet is 
> > > > way 
> > > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple 
> > > > and 
> > > > low-level, and against those crazy SCSI people to begin with.
> > > 
> > > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> > > would probably trash iSCSI for latency if nothing else.
> > 
> > Actually, there's also FCoE now ... which is essentially SCSI
> > encapsulated in Fibre Channel Protocols (FCP) running over ethernet with
> > Jumbo frames.  It does the standard SCSI TCQ, so should answer all the
> > latency pieces.  Intel even has an implementation:
> > 
> > http://www.open-fcoe.org/
> > 
> > I tend to prefer the low levels as well.  The whole disadvantage for IP
> > as regards iSCSI was the layers of protocols on top of it for
> > addressing, authenticating, encrypting and finding any iSCSI device
> > anywhere in the connected universe.
> 
> Btw, while simple in-band discovery of iSCSI exists, the standards based
> IP storage deployments (iSCSI and iFCP) use iSNS (RFC-4171) for
> discovery and network fabric management, for things like sending state
> change notifications when a particular network portal is going away so
> that the initiator can bring up a different communication patch to a
> different network portal, etc.
> 
> > 
> > I tend to see loss of routing from operating at the MAC level to be a
> > nicely justifiable tradeoff (most storage networks tend to be hubbed or
> > switched anyway).  Plus an ethernet MAC with jumbo frames is a large
> > framed nearly lossless medium, which is practically what FCP is
> > expecting.  If you really have to connect large remote sites ... well
> > that's what tunnelling bridges are for.
> > 
> 
> Some of the points by Julo on the IPS TWG iSCSI vs. FCoE thread:
> 
>   * the network is limited in physical span and logical span (number
> of switches)
>   * flow-control/congestion control is achieved with a mechanism
> adequate for a limited span network (credits). The packet loss
> rate is almost nil and that allows FCP to avoid using a
> transport (end-to-end) layer
>   * FCP she switches are simple (addresses are local and the memory
> requirements cam be limited through the credit mechanism)
>   * The credit mechanisms is highly unstable for large networks
> (check switch vendors planning docs for the network diameter
> limits) – the scaling argument
>   * Ethernet has no credit mechanism and any mechanism with a
> similar effect increases the end point cost. Building a
> transport layer in the protocol stack has always been the
> preferred choice of the networking community – the community
> argument
>   * The "performance penalty" of a complete protocol stack has
> always been overstated (and overrated). Advances in protocol
> stack implementation and finer tuning of the congestion control
> mechanisms make conventional TCP/IP performing well even at 10
> Gb/s and over. Moreover the multicore processors that become
> dominant on the computing scene have enough compute cycles
> available to make any "offloading" possible as a mere code
> restructuring exercise (see the stack reports from Intel, IBM
> etc.)
>   * Building on a complete stack makes available a wealth of
> operational and management mechanisms built over the years by
> the networking community (routing, provisioning, security,
> service location etc.) – the community argument
>   * Higher level storage access over an IP network is widely
> available and having both block and file served over the same
> connection with the same support and management structure is
> compelling– the community argument
>   * Highly efficient networks are easy to build over IP with optimal
> (shortest path) routing while Layer 2 networks use bridging and
> are limited by the logical tree structure that bridges must
> follow. The effort to combine routers and bridges (rbridges) is
> promising to change that but it will take some time to finalize
> (and we don't know exactly how it will operate). Untill then the
> scale of Layer 2 network is going to seriously limited – the
> scaling argument
> 

Another data point from the "The "performance penalty of a complete
protocol stack has always been overstated (and overrated)" bullet above:

"As a side argument – a performance comparison made in 1998 showed SCSI
over TCP (a predecessor of the later iSCSI) to perform better than FCP
at 1Gbs for block sizes typical for OLTP (4-8KB). That was w

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 17:00 -0600, James Bottomley wrote:
> On Mon, 2008-02-04 at 22:43 +, Alan Cox wrote:
> > > better. So for example, I personally suspect that ATA-over-ethernet is 
> > > way 
> > > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> > > low-level, and against those crazy SCSI people to begin with.
> > 
> > Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> > would probably trash iSCSI for latency if nothing else.
> 
> Actually, there's also FCoE now ... which is essentially SCSI
> encapsulated in Fibre Channel Protocols (FCP) running over ethernet with
> Jumbo frames.  It does the standard SCSI TCQ, so should answer all the
> latency pieces.  Intel even has an implementation:
> 
> http://www.open-fcoe.org/
> 
> I tend to prefer the low levels as well.  The whole disadvantage for IP
> as regards iSCSI was the layers of protocols on top of it for
> addressing, authenticating, encrypting and finding any iSCSI device
> anywhere in the connected universe.

Btw, while simple in-band discovery of iSCSI exists, the standards based
IP storage deployments (iSCSI and iFCP) use iSNS (RFC-4171) for
discovery and network fabric management, for things like sending state
change notifications when a particular network portal is going away so
that the initiator can bring up a different communication patch to a
different network portal, etc.

> 
> I tend to see loss of routing from operating at the MAC level to be a
> nicely justifiable tradeoff (most storage networks tend to be hubbed or
> switched anyway).  Plus an ethernet MAC with jumbo frames is a large
> framed nearly lossless medium, which is practically what FCP is
> expecting.  If you really have to connect large remote sites ... well
> that's what tunnelling bridges are for.
> 

Some of the points by Julo on the IPS TWG iSCSI vs. FCoE thread:

  * the network is limited in physical span and logical span (number
of switches)
  * flow-control/congestion control is achieved with a mechanism
adequate for a limited span network (credits). The packet loss
rate is almost nil and that allows FCP to avoid using a
transport (end-to-end) layer
  * FCP she switches are simple (addresses are local and the memory
requirements cam be limited through the credit mechanism)
  * The credit mechanisms is highly unstable for large networks
(check switch vendors planning docs for the network diameter
limits) – the scaling argument
  * Ethernet has no credit mechanism and any mechanism with a
similar effect increases the end point cost. Building a
transport layer in the protocol stack has always been the
preferred choice of the networking community – the community
argument
  * The "performance penalty" of a complete protocol stack has
always been overstated (and overrated). Advances in protocol
stack implementation and finer tuning of the congestion control
mechanisms make conventional TCP/IP performing well even at 10
Gb/s and over. Moreover the multicore processors that become
dominant on the computing scene have enough compute cycles
available to make any "offloading" possible as a mere code
restructuring exercise (see the stack reports from Intel, IBM
etc.)
  * Building on a complete stack makes available a wealth of
operational and management mechanisms built over the years by
the networking community (routing, provisioning, security,
service location etc.) – the community argument
  * Higher level storage access over an IP network is widely
available and having both block and file served over the same
connection with the same support and management structure is
compelling– the community argument
  * Highly efficient networks are easy to build over IP with optimal
(shortest path) routing while Layer 2 networks use bridging and
are limited by the logical tree structure that bridges must
follow. The effort to combine routers and bridges (rbridges) is
promising to change that but it will take some time to finalize
(and we don't know exactly how it will operate). Untill then the
scale of Layer 2 network is going to seriously limited – the
scaling argument

Perhaps it would be of worth to get some more linux-net guys in on the
discussion.  :-)

--nab

> James
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Alan Cox wrote:
better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.


AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).

But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
really do tagged queueing, etc.



iSCSI is way, way too complicated.  It's an Internet protocol designed 
by storage designers, what do you expect?


For years I have been hoping that someone will invent a simple protocol 
(w/ strong auth) that can transit ATA and SCSI commands and responses. 
Heck, it would be almost trivial if the kernel had a TLS/SSL implementation.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 22:43 +, Alan Cox wrote:
> > better. So for example, I personally suspect that ATA-over-ethernet is way 
> > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> > low-level, and against those crazy SCSI people to begin with.
> 
> Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> would probably trash iSCSI for latency if nothing else.

Actually, there's also FCoE now ... which is essentially SCSI
encapsulated in Fibre Channel Protocols (FCP) running over ethernet with
Jumbo frames.  It does the standard SCSI TCQ, so should answer all the
latency pieces.  Intel even has an implementation:

http://www.open-fcoe.org/

I tend to prefer the low levels as well.  The whole disadvantage for IP
as regards iSCSI was the layers of protocols on top of it for
addressing, authenticating, encrypting and finding any iSCSI device
anywhere in the connected universe.

I tend to see loss of routing from operating at the MAC level to be a
nicely justifiable tradeoff (most storage networks tend to be hubbed or
switched anyway).  Plus an ethernet MAC with jumbo frames is a large
framed nearly lossless medium, which is practically what FCP is
expecting.  If you really have to connect large remote sites ... well
that's what tunnelling bridges are for.

James

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 22:43 +, Alan Cox wrote:
> > better. So for example, I personally suspect that ATA-over-ethernet is way 
> > better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> > low-level, and against those crazy SCSI people to begin with.
> 
> Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
> would probably trash iSCSI for latency if nothing else.
> 

In the previous iSCSI vs. FCoE points (here is the link again):

http://www.ietf.org/mail-archive/web/ips/current/msg02325.html

the latency discussion is the one bit that is not mentioned.  I always
assumed that back then (as with today) the biggest issue was getting
ethernet hardware, espically switching equipment down to the sub
millisecond latency, and on par with what you would expect from 'real
RDMA' hardware.  In lowest of the low, say sub 10 ns latency, which is
apparently possible with point to point on high-end 10 Gb/sec adapters
today, it would be really interesting to know how much more latency
would be expected between software iSCSI vs. *oE when we work our way
back up the networking stack.

Julo, do you have any idea on this..?

--nab

> 
> Alan
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:
So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?


The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having "file handles" (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.


Well, speaking as a complete nutter who just finished the bare bones of 
an NFSv4 userland server[1]...  it depends on your approach.


If the userland server is the _only_ one accessing the data[2] -- i.e. 
the database server model where ls(1) shows a couple multi-gigabyte 
files or a raw partition -- then it's easy to get all the semantics 
right, including file handles.  You're not racing with local kernel 
fileserving.


Couple that with sendfile(2), sync_file_range(2) and a few other 
Linux-specific syscalls, and you've got an efficient NFS file server.


It becomes a solution similar to Apache or MySQL or Oracle.


I quite grant there are many good reasons to do NFS or iSCSI data path 
in the kernel...  my point is more that "impossible" is just from one 
point of view ;-)



Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there 
would be things like ordering issues.


iSCSI and NBD were passe ideas at birth.  :)

Networked block devices are attractive because the concepts and 
implementation are more simple than networked filesystems... but usually 
you want to run some sort of filesystem on top.  At that point you might 
as well run NFS or [gfs|ocfs|flavor-of-the-week], and ditch your 
networked block device (and associated complexity).


iSCSI is barely useful, because at least someone finally standardized 
SCSI over LAN/WAN.


But you just don't need its complexity if your filesystem must have its 
own authentication, distributed coordination, multiple-connection 
management code of its own.


Jeff


P.S.  Clearly my NFSv4 server is NOT intended to replace the kernel one. 
 It's more for experiments, and doing FUSE-like filesystem work.



[1] http://linux.yyz.us/projects/nfsv4.html

[2] well, outside of dd(1) and similar tricks... the same "going around 
its back" tricks that can screw up a mounted filesystem.


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread Alan Cox

> better. So for example, I personally suspect that ATA-over-ethernet is way 
> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
> low-level, and against those crazy SCSI people to begin with.

Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.


Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 13:24 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, J. Bruce Fields wrote:
> > 
> > I'd assumed the move was primarily because of the difficulty of getting
> > correct semantics on a shared filesystem
> 
> .. not even shared. It was hard to get correct semantics full stop. 
> 
> Which is a traditional problem. The thing is, the kernel always has some 
> internal state, and it's hard to expose all the semantics that the kernel 
> knows about to user space.
> 
> So no, performance is not the only reason to move to kernel space. It can 
> easily be things like needing direct access to internal data queues (for a 
> iSCSI target, this could be things like barriers or just tagged commands - 
> yes, you can probably emulate things like that without access to the 
> actual IO queues, but are you sure the semantics will be entirely right?
> 
> The kernel/userland boundary is not just a performance boundary, it's an 
> abstraction boundary too, and these kinds of protocols tend to break 
> abstractions. NFS broke it by having "file handles" (which is not 
> something that really exists in user space, and is almost impossible to 
> emulate correctly), and I bet the same thing happens when emulating a SCSI 
> target in user space.
> 
> Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there 
> would be things like ordering issues.
> 

.

The iSCSI CDBs and write immediate, unsoliciated, or soliciated data
payloads may be received out of order across communication paths (which
may be going over different subnets) within the nexus, but the execution
of the CDB to SCSI Target Port must be in the same order as it came down
from the SCSI subsystem on the initiator port.  In iSCSI and iSER terms,
this is called Command Sequence Number (CmdSN) ordering, and is enforced
within each nexus.  The initiator node will be assigning the CmdSNs as
the CDBs come down, and when communication paths fail, unacknowledged
CmdSNs will be retried on a different communication path when using
iSCSI/iSER connection recovery.  Already acknowledged CmdSNs will be
explictly retried using a iSCSI specific task management function called
TASK_REASSIGN.  This along with CSM-I and CSM-E statemachines are
collectly known as ErrorRecoveryLevel=2 in iSCSI.

Anyways, here is a great visual of a modern iSCSI Target processor and
SCSI Target Engine.  The CmdSN ordering is representd by the oval across
across iSCSI connections going to various network portals groups on the
left side of the diagram.  Thanks Eddy Q!

http://www.haifa.il.ibm.com/satran/ips/EddyQuicksall-iSCSI-in-diagrams/portal_groups.pdf

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread J. Bruce Fields

On Mon, Feb 04, 2008 at 11:44:31AM -0800, Linus Torvalds wrote:
...
> Pure user-space solutions work, but tend to eventually be turned into 
> kernel-space if they are simple enough and really do have throughput and 
> latency considerations (eg nfsd), and aren't quite complex and crazy 
> enough to have a large impedance-matching problem even for basic IO stuff 
> (eg samba).
...
> So just going by what has happened in the past, I'd assume that iSCSI 
> would eventually turn into "connecting/authentication in user space" with 
> "data transfers in kernel space". But only if it really does end up 
> mattering enough. We had a totally user-space NFS daemon for a long time, 
> and it was perfectly fine until people really started caring.

I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem--if you're content with
NFS-only access to your filesystem, then you can probably do everything
in userspace, but once you start worrying about getting stable
filehandles, consistent file locking, etc., from a real disk filesystem
with local users, then you require much closer cooperation from the
kernel.

And I seem to recall being told that sort of thing was the motivation
more than performance, but I wasn't there (and I haven't seen
performance comparisons).

--b.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, J. Bruce Fields wrote:
> 
> I'd assumed the move was primarily because of the difficulty of getting
> correct semantics on a shared filesystem

.. not even shared. It was hard to get correct semantics full stop. 

Which is a traditional problem. The thing is, the kernel always has some 
internal state, and it's hard to expose all the semantics that the kernel 
knows about to user space.

So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?

The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having "file handles" (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.

Maybe not. I _rally_ haven't looked into iSCSI, I'm just guessing there 
would be things like ordering issues.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 11:44 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, Nicholas A. Bellinger wrote:
> > 
> > While this does not have anything to do directly with the kernel vs. 
> > user discussion for target mode storage engine, the scaling and latency 
> > case is easy enough to make if we are talking about scaling TCP for 10 
> > Gb/sec storage fabrics.
> 
> I would like to point out that while I think there is no question that the 
> basic data transfer engine would perform better in kernel space, there 
> stll *are* questions whether
> 
>  - iSCSI is relevant enough for us to even care ...
> 
>  - ... and the complexity is actually worth it.
> 
> That said, I also tend to believe that trying to split things up between 
> kernel and user space is often more complex than just keeping things in 
> one place, because the trade-offs of which part goes where wll inevitably 
> be wrong in *some* area, and then you're really screwed.
> 
> So from a purely personal standpoint, I'd like to say that I'm not really 
> interested in iSCSI (and I don't quite know why I've been cc'd on this 
> whole discussion)

The generic target mode storage engine discussion quickly goes to
transport specific scenarios.  With so much interest in the SCSI
transports, in particuarly iSCSI, there are lots of devs, users, and
vendors who would like to see Linux improve in this respect.

>  and think that other approaches are potentially *much* 
> better. So for example, I personally suspect that ATA-over-ethernet is way 
> better than some crazy SCSI-over-TCP crap,

Having the non SCSI target mode transports use the same data IO path as
the SCSI ones to SCSI, BIO, and FILE subsystems is something that can
easily be agreed on.  Also having to emulate the non SCSI control paths
in a non generic matter to a target mode engine has to suck (I don't
know what AoE does for that now, considering that this is going down to
libata or real SCSI hardware in some cases.  There are some of the more
arcane task management functionality in SCSI (ACA anyone?) that even
generic SCSI target mode engines do not use, and only seem to make
endlessly complex implement and emulate.

But aside from those very SCSI hardware specific cases, having a generic
method to use something like ABORT_TASK or LUN_RESET for a target mode
engine (along with the data path to all of the subsystems) would be
beneficial for any fabric.

> but I'm biased for simple and 
> low-level, and against those crazy SCSI people to begin with.

Well, having no obvious preconception (well, aside from the email
address), I am of the mindset than the iSCSI people are the LEAST crazy
said crazy SCSI people.  Some people (usually least crazy iSCSI
standards folks) say that FCoE people are crazy.  Being one of the iSCSI
people I am kinda obligated to agree, but the technical points are
really solid, and have been so for over a decade.  They are listed here
for those who are interested:

http://www.ietf.org/mail-archive/web/ips/current/msg02325.html

> 
> So take any utterances of mine with a big pinch of salt.
> 
> Historically, the only split that has worked pretty well is "connection 
> initiation/setup in user space, actual data transfers in kernel space". 
> 
> Pure user-space solutions work, but tend to eventually be turned into 
> kernel-space if they are simple enough and really do have throughput and 
> latency considerations (eg nfsd), and aren't quite complex and crazy 
> enough to have a large impedance-matching problem even for basic IO stuff 
> (eg samba).
> 
> And totally pure kernel solutions work only if there are very stable 
> standards and no major authentication or connection setup issues (eg local 
> disks).
> 
> So just going by what has happened in the past, I'd assume that iSCSI 
> would eventually turn into "connecting/authentication in user space" with 
> "data transfers in kernel space". But only if it really does end up 
> mattering enough. We had a totally user-space NFS daemon for a long time, 
> and it was perfectly fine until people really started caring.

Thanks for putting this into an historical perspective.  Also it is
interesting to note that the iSCSI spec (RFC-3720) was ratified in April
2004, so it will be going on 4 years soon, which pre-RFC products first
going out in 2001 (yikes!).  In my experience, the iSCSI interopt
amongst implementations (espically between different OSes) has been
stable since about late 2004, early 2005, with interopt between OS SCSI
subsystems (espically talking to non SCSI hardware) being the slower of
the two.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-04 Thread 4news

On lunedì 4 febbraio 2008, Linus Torvalds wrote:
> So from a purely personal standpoint, I'd like to say that I'm not really
> interested in iSCSI (and I don't quite know why I've been cc'd on this
> whole discussion) and think that other approaches are potentially *much*
> better. So for example, I personally suspect that ATA-over-ethernet is way
> better than some crazy SCSI-over-TCP crap, but I'm biased for simple and
> low-level, and against those crazy SCSI people to begin with.

surely aoe is better than iscsi almost on performance because of the lesser 
protocol stack:
iscsi ->  scsi - ip - eth
aoe -> ata - eth

but surely iscsi is more a standard than aoe and is more actively used by 
real-world .

Other really useful feature are that:
- iscsi is capable to move to a ip based san scsi devices by routing that ( 
i've some tape changer routed by scst to some system that don't have other 
way to see a tape).
- because it work on the ip layer it can be routed between long distance , so 
having needed bandwidth you can have a really remote block device spoking a 
standard protocol between non ethereogenus systems.
- iscsi is now the cheapest san avaible.

bye,
marco.

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, Nicholas A. Bellinger wrote:
> 
> While this does not have anything to do directly with the kernel vs. 
> user discussion for target mode storage engine, the scaling and latency 
> case is easy enough to make if we are talking about scaling TCP for 10 
> Gb/sec storage fabrics.

I would like to point out that while I think there is no question that the 
basic data transfer engine would perform better in kernel space, there 
stll *are* questions whether

 - iSCSI is relevant enough for us to even care ...

 - ... and the complexity is actually worth it.

That said, I also tend to believe that trying to split things up between 
kernel and user space is often more complex than just keeping things in 
one place, because the trade-offs of which part goes where wll inevitably 
be wrong in *some* area, and then you're really screwed.

So from a purely personal standpoint, I'd like to say that I'm not really 
interested in iSCSI (and I don't quite know why I've been cc'd on this 
whole discussion) and think that other approaches are potentially *much* 
better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.

So take any utterances of mine with a big pinch of salt.

Historically, the only split that has worked pretty well is "connection 
initiation/setup in user space, actual data transfers in kernel space". 

Pure user-space solutions work, but tend to eventually be turned into 
kernel-space if they are simple enough and really do have throughput and 
latency considerations (eg nfsd), and aren't quite complex and crazy 
enough to have a large impedance-matching problem even for basic IO stuff 
(eg samba).

And totally pure kernel solutions work only if there are very stable 
standards and no major authentication or connection setup issues (eg local 
disks).

So just going by what has happened in the past, I'd assume that iSCSI 
would eventually turn into "connecting/authentication in user space" with 
"data transfers in kernel space". But only if it really does end up 
mattering enough. We had a totally user-space NFS daemon for a long time, 
and it was perfectly fine until people really started caring.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 11:06 -0800, Nicholas A. Bellinger wrote:
> On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> > 
> > On Mon, 4 Feb 2008, James Bottomley wrote:
> > > 
> > > The way a user space solution should work is to schedule mmapped I/O
> > > from the backing store and then send this mmapped region off for target
> > > I/O.
> > 
> > mmap'ing may avoid the copy, but the overhead of a mmap operation is 
> > quite often much *bigger* than the overhead of a copy operation.
> > 
> > Please do not advocate the use of mmap() as a way to avoid memory copies. 
> > It's not realistic. Even if you can do it with a single "mmap()" system 
> > call (which is not at all a given, considering that block devices can 
> > easily be much larger than the available virtual memory space), the fact 
> > is that page table games along with the fault (and even just TLB miss) 
> > overhead is easily more than the cost of copying a page in a nice 
> > streaming manner.
> > 
> > Yes, memory is "slow", but dammit, so is mmap().
> > 
> > > You also have to pull tricks with the mmap region in the case of writes 
> > > to prevent useless data being read in from the backing store.  However, 
> > > none of this involves data copies.
> > 
> > "data copies" is irrelevant. The only thing that matters is performance. 
> > And if avoiding data copies is more costly (or even of a similar cost) 
> > than the copies themselves would have been, there is absolutely no upside, 
> > and only downsides due to extra complexity.
> > 
> 
> The iSER spec (RFC-5046) quotes the following in the TCP case for direct
> data placement:
> 
> "  Out-of-order TCP segments in the Traditional iSCSI model have to be
>stored and reassembled before the iSCSI protocol layer within an end
>node can place the data in the iSCSI buffers.  This reassembly is
>required because not every TCP segment is likely to contain an iSCSI
>header to enable its placement, and TCP itself does not have a
>built-in mechanism for signaling Upper Level Protocol (ULP) message
>boundaries to aid placement of out-of-order segments.  This TCP
>reassembly at high network speeds is quite counter-productive for the
>following reasons: wasted memory bandwidth in data copying, the need
>for reassembly memory, wasted CPU cycles in data copying, and the
>general store-and-forward latency from an application perspective."
> 
> While this does not have anything to do directly with the kernel vs. user 
> discussion
> for target mode storage engine, the scaling and latency case is easy enough
> to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.
> 
> > If you want good performance for a service like this, you really generally 
> > *do* need to in kernel space. You can play games in user space, but you're 
> > fooling yourself if you think you can do as well as doing it in the 
> > kernel. And you're *definitely* fooling yourself if you think mmap() 
> > solves performance issues. "Zero-copy" does not equate to "fast". Memory 
> > speeds may be slower that core CPU speeds, but not infinitely so!
> > 
> 
> >From looking at this problem from a kernel space perspective for a
> number of years, I would be inclined to believe this is true for
> software and hardware data-path cases.  The benefits of moving various
> control statemachines for something like say traditional iSCSI to
> userspace has always been debateable.  The most obvious ones are things
> like authentication, espically if something more complex than CHAP are
> the obvious case for userspace.  However, I have thought recovery for
> failures caused from communication path (iSCSI connections) or entire
> nexuses (iSCSI sessions) failures was very problematic to expect to have
> to potentially push down IOs state to userspace.
> 
> Keeping statemachines for protocol and/or fabric specific statemachines
> (CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
> obvious ones) are the best canidates for residing in kernel space.
> 
> > (That said: there *are* alternatives to mmap, like "splice()", that really 
> > do potentially solve some issues without the page table and TLB overheads. 
> > But while splice() avoids the costs of paging, I strongly suspect it would 
> > still have easily measurable latency issues. Switching between user and 
> > kernel space multiple times is definitely not going to be free, although 
> > it's probably not a huge issue if you have big enough requests).
> > 
> 

Then again, having some data-path for software and hardware bulk IO
operation of storage fabric protocol / statemachine in userspace would
be really interesting for something like an SPU enabled engine for the
Cell Broadband Architecture.

--nab



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, James Bottomley wrote:
> > 
> > The way a user space solution should work is to schedule mmapped I/O
> > from the backing store and then send this mmapped region off for target
> > I/O.
> 
> mmap'ing may avoid the copy, but the overhead of a mmap operation is 
> quite often much *bigger* than the overhead of a copy operation.
> 
> Please do not advocate the use of mmap() as a way to avoid memory copies. 
> It's not realistic. Even if you can do it with a single "mmap()" system 
> call (which is not at all a given, considering that block devices can 
> easily be much larger than the available virtual memory space), the fact 
> is that page table games along with the fault (and even just TLB miss) 
> overhead is easily more than the cost of copying a page in a nice 
> streaming manner.
> 
> Yes, memory is "slow", but dammit, so is mmap().
> 
> > You also have to pull tricks with the mmap region in the case of writes 
> > to prevent useless data being read in from the backing store.  However, 
> > none of this involves data copies.
> 
> "data copies" is irrelevant. The only thing that matters is performance. 
> And if avoiding data copies is more costly (or even of a similar cost) 
> than the copies themselves would have been, there is absolutely no upside, 
> and only downsides due to extra complexity.
> 

The iSER spec (RFC-5046) quotes the following in the TCP case for direct
data placement:

"  Out-of-order TCP segments in the Traditional iSCSI model have to be
   stored and reassembled before the iSCSI protocol layer within an end
   node can place the data in the iSCSI buffers.  This reassembly is
   required because not every TCP segment is likely to contain an iSCSI
   header to enable its placement, and TCP itself does not have a
   built-in mechanism for signaling Upper Level Protocol (ULP) message
   boundaries to aid placement of out-of-order segments.  This TCP
   reassembly at high network speeds is quite counter-productive for the
   following reasons: wasted memory bandwidth in data copying, the need
   for reassembly memory, wasted CPU cycles in data copying, and the
   general store-and-forward latency from an application perspective."

While this does not have anything to do directly with the kernel vs. user 
discussion
for target mode storage engine, the scaling and latency case is easy enough
to make if we are talking about scaling TCP for 10 Gb/sec storage fabrics.

> If you want good performance for a service like this, you really generally 
> *do* need to in kernel space. You can play games in user space, but you're 
> fooling yourself if you think you can do as well as doing it in the 
> kernel. And you're *definitely* fooling yourself if you think mmap() 
> solves performance issues. "Zero-copy" does not equate to "fast". Memory 
> speeds may be slower that core CPU speeds, but not infinitely so!
> 

>From looking at this problem from a kernel space perspective for a
number of years, I would be inclined to believe this is true for
software and hardware data-path cases.  The benefits of moving various
control statemachines for something like say traditional iSCSI to
userspace has always been debateable.  The most obvious ones are things
like authentication, espically if something more complex than CHAP are
the obvious case for userspace.  However, I have thought recovery for
failures caused from communication path (iSCSI connections) or entire
nexuses (iSCSI sessions) failures was very problematic to expect to have
to potentially push down IOs state to userspace.

Keeping statemachines for protocol and/or fabric specific statemachines
(CSM-E and CSM-I from connection recovery in iSCSI and iSER are the
obvious ones) are the best canidates for residing in kernel space.

> (That said: there *are* alternatives to mmap, like "splice()", that really 
> do potentially solve some issues without the page table and TLB overheads. 
> But while splice() avoids the costs of paging, I strongly suspect it would 
> still have easily measurable latency issues. Switching between user and 
> kernel space multiple times is definitely not going to be free, although 
> it's probably not a huge issue if you have big enough requests).
> 

Most of the SCSI OS storage subsystems that I have worked with in the
context of iSCSI have used 256 * 512 byte setctor requests, which the
default traditional iSCSI PDU data payload (MRDSL) being 64k to hit the
sweet spot with crc32c checksum calculations.  I am assuming this is
going to be the case for other fabrics as well.

--nab

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote:
> James Bottomley wrote:
> > On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:
> > 
> >>James Bottomley wrote:
> >>
> >>>On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:
> >>>
> >>>
> James Bottomley wrote:
> 
> 
> So, James, what is your opinion on the above? Or the overall SCSI 
> target 
> project simplicity doesn't matter much for you and you think it's 
> fine 
> to duplicate Linux page cache in the user space to keep the in-kernel 
> part of the project as small as possible?
> >>>
> >>>
> >>>The answers were pretty much contained here
> >>>
> >>>http://marc.info/?l=linux-scsi&m=120164008302435
> >>>
> >>>and here:
> >>>
> >>>http://marc.info/?l=linux-scsi&m=120171067107293
> >>>
> >>>Weren't they?
> >>
> >>No, sorry, it doesn't look so for me. They are about performance, but 
> >>I'm asking about the overall project's architecture, namely about one 
> >>part of it: simplicity. Particularly, what do you think about 
> >>duplicating Linux page cache in the user space to have zero-copy cached 
> >>I/O? Or can you suggest another architectural solution for that problem 
> >>in the STGT's approach?
> >
> >
> >Isn't that an advantage of a user space solution?  It simply uses the
> >backing store of whatever device supplies the data.  That means it takes
> >advantage of the existing mechanisms for caching.
> 
> No, please reread this thread, especially this message: 
> http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
> the advantages of the kernel space implementation. The user space 
> implementation has to have data copied between the cache and user space 
> buffer, but the kernel space one can use pages in the cache directly, 
> without extra copy.
> >>>
> >>>
> >>>Well, you've said it thrice (the bellman cried) but that doesn't make it
> >>>true.
> >>>
> >>>The way a user space solution should work is to schedule mmapped I/O
> >>>from the backing store and then send this mmapped region off for target
> >>>I/O.  For reads, the page gather will ensure that the pages are up to
> >>>date from the backing store to the cache before sending the I/O out.
> >>>For writes, You actually have to do a msync on the region to get the
> >>>data secured to the backing store. 
> >>
> >>James, have you checked how fast is mmaped I/O if work size > size of 
> >>RAM? It's several times slower comparing to buffered I/O. It was many 
> >>times discussed in LKML and, seems, VM people consider it unavoidable. 
> > 
> > 
> > Erm, but if you're using the case of work size > size of RAM, you'll
> > find buffered I/O won't help because you don't have the memory for
> > buffers either.
> 
> James, just check and you will see, buffered I/O is a lot faster.

So in an out of memory situation the buffers you don't have are a lot
faster than the pages I don't have?

> >>So, using mmaped IO isn't an option for high performance. Plus, mmaped 
> >>IO isn't an option for high reliability requirements, since it doesn't 
> >>provide a practical way to handle I/O errors.
> > 
> > I think you'll find it does ... the page gather returns -EFAULT if
> > there's an I/O error in the gathered region. 
> 
> Err, to whom return? If you try to read from a mmaped page, which can't 
> be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
> remember exactly. It's quite tricky to get back to the faulted command 
> from the signal handler.
> 
> Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
> think that such mapping/unmapping is good for performance?
> 
> > msync does something
> > similar if there's a write failure.
> > 
> >>>You also have to pull tricks with
> >>>the mmap region in the case of writes to prevent useless data being read
> >>>in from the backing store.
> >>
> >>Can you be more exact and specify what kind of tricks should be done for 
> >>that?
> > 
> > Actually, just avoid touching it seems to do the trick with a recent
> > kernel.
> 
> Hmm, how can one write to an mmaped page and don't touch it?

I meant from user space ... the writes are done inside the kernel.

However, as Linus has pointed out, this discussion is getting a bit off
topic.  There's no actual evidence that copy problems are causing any
performatince issues issues for STGT.  In fact, there's evidence that
they're not for everything except IB networks.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 10:29 -0800, Linus Torvalds wrote:
> 
> On Mon, 4 Feb 2008, James Bottomley wrote:
> > 
> > The way a user space solution should work is to schedule mmapped I/O
> > from the backing store and then send this mmapped region off for target
> > I/O.
> 
> mmap'ing may avoid the copy, but the overhead of a mmap operation is 
> quite often much *bigger* than the overhead of a copy operation.
> 
> Please do not advocate the use of mmap() as a way to avoid memory copies. 
> It's not realistic. Even if you can do it with a single "mmap()" system 
> call (which is not at all a given, considering that block devices can 
> easily be much larger than the available virtual memory space), the fact 
> is that page table games along with the fault (and even just TLB miss) 
> overhead is easily more than the cost of copying a page in a nice 
> streaming manner.
> 
> Yes, memory is "slow", but dammit, so is mmap().
> 
> > You also have to pull tricks with the mmap region in the case of writes 
> > to prevent useless data being read in from the backing store.  However, 
> > none of this involves data copies.
> 
> "data copies" is irrelevant. The only thing that matters is performance. 
> And if avoiding data copies is more costly (or even of a similar cost) 
> than the copies themselves would have been, there is absolutely no upside, 
> and only downsides due to extra complexity.
> 
> If you want good performance for a service like this, you really generally 
> *do* need to in kernel space. You can play games in user space, but you're 
> fooling yourself if you think you can do as well as doing it in the 
> kernel. And you're *definitely* fooling yourself if you think mmap() 
> solves performance issues. "Zero-copy" does not equate to "fast". Memory 
> speeds may be slower that core CPU speeds, but not infinitely so!
> 
> (That said: there *are* alternatives to mmap, like "splice()", that really 
> do potentially solve some issues without the page table and TLB overheads. 
> But while splice() avoids the costs of paging, I strongly suspect it would 
> still have easily measurable latency issues. Switching between user and 
> kernel space multiple times is definitely not going to be free, although 
> it's probably not a huge issue if you have big enough requests).

Sorry ... this is really just a discussion of how something (zero copy)
could be done, rather than an implementation proposal.  (I'm not
actually planning to make the STGT people do anything ... although
investigating splice does sound interesting).

Right at the moment, STGT seems to be performing just fine on
measurements up to gigabit networks.  There are suggestions that there
may be a problem on 8G IB networks, but it's not definitive yet.

I'm already on record as saying I think the best fix for IB networks is
just to reduce the context switches by increasing the transfer size, but
the infrastructure to allow that only just went into git head.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:



James Bottomley wrote:


So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsi&m=120164008302435

and here:

http://marc.info/?l=linux-scsi&m=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size > size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 



Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.


James, just check and you will see, buffered I/O is a lot faster.

So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.


I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. 


Err, to whom return? If you try to read from a mmaped page, which can't 
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
remember exactly. It's quite tricky to get back to the faulted command 
from the signal handler.


Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
think that such mapping/unmapping is good for performance?



msync does something
similar if there's a write failure.


You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?


Actually, just avoid touching it seems to do the trick with a recent
kernel.


Hmm, how can one write to an mmaped page and don't touch it?


James





-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 4 Feb 2008, James Bottomley wrote:
> 
> The way a user space solution should work is to schedule mmapped I/O
> from the backing store and then send this mmapped region off for target
> I/O.

mmap'ing may avoid the copy, but the overhead of a mmap operation is 
quite often much *bigger* than the overhead of a copy operation.

Please do not advocate the use of mmap() as a way to avoid memory copies. 
It's not realistic. Even if you can do it with a single "mmap()" system 
call (which is not at all a given, considering that block devices can 
easily be much larger than the available virtual memory space), the fact 
is that page table games along with the fault (and even just TLB miss) 
overhead is easily more than the cost of copying a page in a nice 
streaming manner.

Yes, memory is "slow", but dammit, so is mmap().

> You also have to pull tricks with the mmap region in the case of writes 
> to prevent useless data being read in from the backing store.  However, 
> none of this involves data copies.

"data copies" is irrelevant. The only thing that matters is performance. 
And if avoiding data copies is more costly (or even of a similar cost) 
than the copies themselves would have been, there is absolutely no upside, 
and only downsides due to extra complexity.

If you want good performance for a service like this, you really generally 
*do* need to in kernel space. You can play games in user space, but you're 
fooling yourself if you think you can do as well as doing it in the 
kernel. And you're *definitely* fooling yourself if you think mmap() 
solves performance issues. "Zero-copy" does not equate to "fast". Memory 
speeds may be slower that core CPU speeds, but not infinitely so!

(That said: there *are* alternatives to mmap, like "splice()", that really 
do potentially solve some issues without the page table and TLB overheads. 
But while splice() avoids the costs of paging, I strongly suspect it would 
still have easily measurable latency issues. Switching between user and 
kernel space multiple times is definitely not going to be free, although 
it's probably not a huge issue if you have big enough requests).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:
> James Bottomley wrote:
> > On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:
> > 
> >>James Bottomley wrote:
> >>
> >>So, James, what is your opinion on the above? Or the overall SCSI 
> >>target 
> >>project simplicity doesn't matter much for you and you think it's fine 
> >>to duplicate Linux page cache in the user space to keep the in-kernel 
> >>part of the project as small as possible?
> >
> >
> >The answers were pretty much contained here
> >
> >http://marc.info/?l=linux-scsi&m=120164008302435
> >
> >and here:
> >
> >http://marc.info/?l=linux-scsi&m=120171067107293
> >
> >Weren't they?
> 
> No, sorry, it doesn't look so for me. They are about performance, but 
> I'm asking about the overall project's architecture, namely about one 
> part of it: simplicity. Particularly, what do you think about 
> duplicating Linux page cache in the user space to have zero-copy cached 
> I/O? Or can you suggest another architectural solution for that problem 
> in the STGT's approach?
> >>>
> >>>
> >>>Isn't that an advantage of a user space solution?  It simply uses the
> >>>backing store of whatever device supplies the data.  That means it takes
> >>>advantage of the existing mechanisms for caching.
> >>
> >>No, please reread this thread, especially this message: 
> >>http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
> >>the advantages of the kernel space implementation. The user space 
> >>implementation has to have data copied between the cache and user space 
> >>buffer, but the kernel space one can use pages in the cache directly, 
> >>without extra copy.
> > 
> > 
> > Well, you've said it thrice (the bellman cried) but that doesn't make it
> > true.
> > 
> > The way a user space solution should work is to schedule mmapped I/O
> > from the backing store and then send this mmapped region off for target
> > I/O.  For reads, the page gather will ensure that the pages are up to
> > date from the backing store to the cache before sending the I/O out.
> > For writes, You actually have to do a msync on the region to get the
> > data secured to the backing store. 
> 
> James, have you checked how fast is mmaped I/O if work size > size of 
> RAM? It's several times slower comparing to buffered I/O. It was many 
> times discussed in LKML and, seems, VM people consider it unavoidable. 

Erm, but if you're using the case of work size > size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.

> So, using mmaped IO isn't an option for high performance. Plus, mmaped 
> IO isn't an option for high reliability requirements, since it doesn't 
> provide a practical way to handle I/O errors.

I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region.  msync does something
similar if there's a write failure.

> > You also have to pull tricks with
> > the mmap region in the case of writes to prevent useless data being read
> > in from the backing store.
> 
> Can you be more exact and specify what kind of tricks should be done for 
> that?

Actually, just avoid touching it seems to do the trick with a recent
kernel.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:

So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsi&m=120164008302435

and here:

http://marc.info/?l=linux-scsi&m=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size > size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 
So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.



You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?



 However, none of this involves data copies.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:
> James Bottomley wrote:
> So, James, what is your opinion on the above? Or the overall SCSI target 
> project simplicity doesn't matter much for you and you think it's fine 
> to duplicate Linux page cache in the user space to keep the in-kernel 
> part of the project as small as possible?
> >>>
> >>>
> >>>The answers were pretty much contained here
> >>>
> >>>http://marc.info/?l=linux-scsi&m=120164008302435
> >>>
> >>>and here:
> >>>
> >>>http://marc.info/?l=linux-scsi&m=120171067107293
> >>>
> >>>Weren't they?
> >>
> >>No, sorry, it doesn't look so for me. They are about performance, but 
> >>I'm asking about the overall project's architecture, namely about one 
> >>part of it: simplicity. Particularly, what do you think about 
> >>duplicating Linux page cache in the user space to have zero-copy cached 
> >>I/O? Or can you suggest another architectural solution for that problem 
> >>in the STGT's approach?
> > 
> > 
> > Isn't that an advantage of a user space solution?  It simply uses the
> > backing store of whatever device supplies the data.  That means it takes
> > advantage of the existing mechanisms for caching.
> 
> No, please reread this thread, especially this message: 
> http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
> the advantages of the kernel space implementation. The user space 
> implementation has to have data copied between the cache and user space 
> buffer, but the kernel space one can use pages in the cache directly, 
> without extra copy.

Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store.  You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.  However, none of this involves data copies.

James


-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:
So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsi&m=120164008302435

and here:

http://marc.info/?l=linux-scsi&m=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernel&m=120169189504361&w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.


Vlad
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Bart Van Assche wrote:

On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin <[EMAIL PROTECTED]> wrote:


So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?



It's too early to draw conclusions about performance. I'm currently
performing more measurements, and the results are not easy to
interpret. My plan is to measure the following:
* Setup: target with RAM disk of 2 GB as backing storage.
* Throughput reported by dd and xdd (direct I/O).
* Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest
transfer size that can be specified to xdd is 1 KB).
* Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI
via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via
IPoIB.

The reason I chose dd/xdd for these tests is that I want to measure
the performance of the communication protocols, and that I am assuming
that this performance can be modeled by the following formula:
(transfer time in s) = (transfer setup latency in s) + (transfer size
in MB) / (bandwidth in MB/s).


It isn't fully correct, you forgot about link latency. More correct one is:

(transfer time) = (transfer setup latency on both initiator and target, 
consisting from software processing time, including memory copy, if 
necessary, and PCI setup/transfer time) + (transfer size)/(bandwidth) + 
(link latency to deliver request for READs or status for WRITES) + 
(2*(link latency) to deliver R2T/XFER_READY request in case of WRITEs, 
if necessary (e.g. iSER for small transfers might not need it, but SRP 
most likely always needs it)). Also you should note that it's correct 
only in case of single threaded workloads with one outstanding command 
at time. For other workloads it depends from how well they manage to 
keep the "link" full in interval from (transfer size)/(transfer time) to 
bandwidth.



Measuring the time needed for transfers
with varying block size allows to compute the constants in the above
formula via linear regression.


Unfortunately, it isn't so easy, see above.


One difficulty I already encountered is that the performance of the
Linux IPoIB implementation varies a lot under high load
(http://bugzilla.kernel.org/show_bug.cgi?id=9883).

Another issue I have to look further into is that dd and xdd report
different results for very large block sizes (> 1 MB).


Look at /proc/scsi_tgt/sgv (for SCST) and you will see, which transfer 
sizes are actually used. Initiators don't like sending big requests and 
often split them on smaller ones.


Look at this message as well, it might be helpful: 
http://lkml.org/lkml/2007/5/16/223



Bart Van Assche.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel