Re: [PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-28 Thread Nagendra Tomar
James,
  I am facing issues with device removal being done when there
are commands outstanding in the LLD. As explained in my original post,
its resulting in effects ranging from the duplicate kobject warnings to
the inability of the scsi subsystem to find a valid device (all symptoms
are related to the fact that two scsi devices with the same HBTL but in 
diff states, SDEV_DEL and SDEV_RUNNING, are in the lists).
  After extensive audit of the code I'm quite certain that the 
aforementioned
patch is the right thing to do. I'll greatly appreciate if you can highlight
reason(s) why that might not be a good idea. Pls apply o/w.

Thanx,
Tomar

--- Nagendra Tomar <[EMAIL PROTECTED]> wrote:

> Hello James,
>  My understanding is that the scsi_device in SDEV_DEL state
> is there in the scsi_host->devices/scsi_target->devices queue, just
> because there is some outstanding command holding a reference to it.
> 
> It will need the device when it completes. Apart from this, for all 
> practical purposes the scsi_device is gone from the system. It could
> as well be removed from scsi_host->devices/scsi_target->devices lists
> and be put in some other list, just to hold the scsi_device till
> commands refering to it are completed. 
> 
> The scanning code should not consider these devices to be present 
> in the system. This is correctly handled today, as 
> scsi_device_lookup/scsi_device_lookup_by_target return NULL if there
> is an SDEV_DEL device in the list. 
> 
> Since a scsi_device in SDEV_DEL state is "gone" from the system we 
> should not hold any fresh references on to this device. The current
> scsi_device_lookup/scsi_device_lookup_by_target implementation rightly
> ensures that. And since all the sysfs linkages are also removed (in
> __scsi_remove_device->device_del), user space can also not reference 
> this device. All is good till now.
> 
> The problem happens when we try to add a new scsi_device with the same 
> HBTL. Since we consider the device "gone" (as described above) we should
> allow a new scsi_device with the same HBTL to be added (to the 
> scsi_host->devices/scsi_target->devices list). This part is also correctly
> implemented.
> scsi_add_device->...->scsi_probe_and_add_lun->scsi_device_lookup_by_target
> will _not_ return the device present in the SDEV_DEL state and hence the
> scanning will go ahead and try to add a new scsi_device (this one in 
> SDEV_RUNNING state) to the devices lists.
> 
> All is good even till now.
> 
> The PROBLEM is, now any scsi_device_lookup call trying to lookup this newly
> added scsi_device, fails. This is because, 
> scsi_device_lookup->__scsi_device_lookup
> returns the first device that it finds in the list, which in this case
> is the one in the SDEV_DEL state. Now the scsi_device_get call that 
> scsi_device_lookup makes to get a reference on that device returns ENXIO
> as the device is in SDEV_DEL state, resulting in scsi_device_lookup to 
> return NULL.
> 
> What scsi_device_get does, is right, as we do not want to hold fresh 
> references on scsi_devices in SDEV_DEL state. The problem is because
> of this we fail to lookup the perfectly legitimate device (in SDEV_RUNNING
> state) with the same HBTL sitting in the list.
> 
> One of the side effects of this is that the scsi_probe_and_add_lun()
> goes ahead with the scanning and tries to add this existent device. 
> This is the real problem.
> 
> My patch avoids this problem by not breaking from the __scsi_device_lookup
> loop, if the device is in SDEV_DEL state. After all we should not consider
> these devices to be part of the system. This will allow us to 
> find the right scsi_device and this "trying to add an existent device"
> problem will be avoided. 
> 
> And also why should scsi_device_lookup and __scsi_device_lookup be 
> different in behaviour. One returns devices in SDEV_DEL state, the other
> doesn't. The comments suggest that they can be used interchangibly, but
> for the locking and the extra reference that the scsi_device_lookup holds.
> 
> This is fixed as a side effect of the patch.
> 
> Comments welcome.
> 
> Thanx,
> Tomar
> 
> 
> 
> 
> 
> --- James Smart <[EMAIL PROTECTED]> wrote:
> 
> > 
> > This sounds like a return to the old behavior, where sdevs in SDEV_DEL
> > were ignored. However, it too had lots of bad effects. We'd have to go
> > back to the threads over the last 2 years that justified resurrecting
> > the sdev. Start looking at threads like :
> > http://marc.info/?l=linux-scsi=11788730468=2
> > http://marc.info/?l=linux-scsi=116837744314913=2
> > http://marc.info/?l=linux-scsi=117139230702785=2
>

Re: [PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-28 Thread Nagendra Tomar
James,
  I am facing issues with device removal being done when there
are commands outstanding in the LLD. As explained in my original post,
its resulting in effects ranging from the duplicate kobject warnings to
the inability of the scsi subsystem to find a valid device (all symptoms
are related to the fact that two scsi devices with the same HBTL but in 
diff states, SDEV_DEL and SDEV_RUNNING, are in the lists).
  After extensive audit of the code I'm quite certain that the 
aforementioned
patch is the right thing to do. I'll greatly appreciate if you can highlight
reason(s) why that might not be a good idea. Pls apply o/w.

Thanx,
Tomar

--- Nagendra Tomar [EMAIL PROTECTED] wrote:

 Hello James,
  My understanding is that the scsi_device in SDEV_DEL state
 is there in the scsi_host-devices/scsi_target-devices queue, just
 because there is some outstanding command holding a reference to it.
 
 It will need the device when it completes. Apart from this, for all 
 practical purposes the scsi_device is gone from the system. It could
 as well be removed from scsi_host-devices/scsi_target-devices lists
 and be put in some other list, just to hold the scsi_device till
 commands refering to it are completed. 
 
 The scanning code should not consider these devices to be present 
 in the system. This is correctly handled today, as 
 scsi_device_lookup/scsi_device_lookup_by_target return NULL if there
 is an SDEV_DEL device in the list. 
 
 Since a scsi_device in SDEV_DEL state is gone from the system we 
 should not hold any fresh references on to this device. The current
 scsi_device_lookup/scsi_device_lookup_by_target implementation rightly
 ensures that. And since all the sysfs linkages are also removed (in
 __scsi_remove_device-device_del), user space can also not reference 
 this device. All is good till now.
 
 The problem happens when we try to add a new scsi_device with the same 
 HBTL. Since we consider the device gone (as described above) we should
 allow a new scsi_device with the same HBTL to be added (to the 
 scsi_host-devices/scsi_target-devices list). This part is also correctly
 implemented.
 scsi_add_device-...-scsi_probe_and_add_lun-scsi_device_lookup_by_target
 will _not_ return the device present in the SDEV_DEL state and hence the
 scanning will go ahead and try to add a new scsi_device (this one in 
 SDEV_RUNNING state) to the devices lists.
 
 All is good even till now.
 
 The PROBLEM is, now any scsi_device_lookup call trying to lookup this newly
 added scsi_device, fails. This is because, 
 scsi_device_lookup-__scsi_device_lookup
 returns the first device that it finds in the list, which in this case
 is the one in the SDEV_DEL state. Now the scsi_device_get call that 
 scsi_device_lookup makes to get a reference on that device returns ENXIO
 as the device is in SDEV_DEL state, resulting in scsi_device_lookup to 
 return NULL.
 
 What scsi_device_get does, is right, as we do not want to hold fresh 
 references on scsi_devices in SDEV_DEL state. The problem is because
 of this we fail to lookup the perfectly legitimate device (in SDEV_RUNNING
 state) with the same HBTL sitting in the list.
 
 One of the side effects of this is that the scsi_probe_and_add_lun()
 goes ahead with the scanning and tries to add this existent device. 
 This is the real problem.
 
 My patch avoids this problem by not breaking from the __scsi_device_lookup
 loop, if the device is in SDEV_DEL state. After all we should not consider
 these devices to be part of the system. This will allow us to 
 find the right scsi_device and this trying to add an existent device
 problem will be avoided. 
 
 And also why should scsi_device_lookup and __scsi_device_lookup be 
 different in behaviour. One returns devices in SDEV_DEL state, the other
 doesn't. The comments suggest that they can be used interchangibly, but
 for the locking and the extra reference that the scsi_device_lookup holds.
 
 This is fixed as a side effect of the patch.
 
 Comments welcome.
 
 Thanx,
 Tomar
 
 
 
 
 
 --- James Smart [EMAIL PROTECTED] wrote:
 
  
  This sounds like a return to the old behavior, where sdevs in SDEV_DEL
  were ignored. However, it too had lots of bad effects. We'd have to go
  back to the threads over the last 2 years that justified resurrecting
  the sdev. Start looking at threads like :
  http://marc.info/?l=linux-scsim=11788730468w=2
  http://marc.info/?l=linux-scsim=116837744314913w=2
  http://marc.info/?l=linux-scsim=117139230702785w=2
  http://marc.info/?l=linux-scsim=117991046126294w=2
  Also, there's multiple parts to this - the sdev struct, and the sysfs 
  objects
  and thus namespace associated with the struct, etc.
  
  So, in my mind, if this reverts to ignoring sdevs in SDEV_DEL, and creates
  a duplicate sdev in SDEV_RUNNING, then it's the wrong patch.  What should
  be considered is where did the resurrection of the sdev go wrong.  I
  remember that Hannes did some updates
  http

Re: [PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-23 Thread Nagendra Tomar
Hello James,
 My understanding is that the scsi_device in SDEV_DEL state
is there in the scsi_host->devices/scsi_target->devices queue, just
because there is some outstanding command holding a reference to it.

It will need the device when it completes. Apart from this, for all 
practical purposes the scsi_device is gone from the system. It could
as well be removed from scsi_host->devices/scsi_target->devices lists
and be put in some other list, just to hold the scsi_device till
commands refering to it are completed. 

The scanning code should not consider these devices to be present 
in the system. This is correctly handled today, as 
scsi_device_lookup/scsi_device_lookup_by_target return NULL if there
is an SDEV_DEL device in the list. 

Since a scsi_device in SDEV_DEL state is "gone" from the system we 
should not hold any fresh references on to this device. The current
scsi_device_lookup/scsi_device_lookup_by_target implementation rightly
ensures that. And since all the sysfs linkages are also removed (in
__scsi_remove_device->device_del), user space can also not reference 
this device. All is good till now.

The problem happens when we try to add a new scsi_device with the same 
HBTL. Since we consider the device "gone" (as described above) we should
allow a new scsi_device with the same HBTL to be added (to the 
scsi_host->devices/scsi_target->devices list). This part is also correctly
implemented.
scsi_add_device->...->scsi_probe_and_add_lun->scsi_device_lookup_by_target
will _not_ return the device present in the SDEV_DEL state and hence the
scanning will go ahead and try to add a new scsi_device (this one in 
SDEV_RUNNING state) to the devices lists.

All is good even till now.

The PROBLEM is, now any scsi_device_lookup call trying to lookup this newly
added scsi_device, fails. This is because, 
scsi_device_lookup->__scsi_device_lookup
returns the first device that it finds in the list, which in this case
is the one in the SDEV_DEL state. Now the scsi_device_get call that 
scsi_device_lookup makes to get a reference on that device returns ENXIO
as the device is in SDEV_DEL state, resulting in scsi_device_lookup to 
return NULL.

What scsi_device_get does, is right, as we do not want to hold fresh 
references on scsi_devices in SDEV_DEL state. The problem is because
of this we fail to lookup the perfectly legitimate device (in SDEV_RUNNING
state) with the same HBTL sitting in the list.

One of the side effects of this is that the scsi_probe_and_add_lun()
goes ahead with the scanning and tries to add this existent device. 
This is the real problem.

My patch avoids this problem by not breaking from the __scsi_device_lookup
loop, if the device is in SDEV_DEL state. After all we should not consider
these devices to be part of the system. This will allow us to 
find the right scsi_device and this "trying to add an existent device"
problem will be avoided. 

And also why should scsi_device_lookup and __scsi_device_lookup be 
different in behaviour. One returns devices in SDEV_DEL state, the other
doesn't. The comments suggest that they can be used interchangibly, but
for the locking and the extra reference that the scsi_device_lookup holds.

This is fixed as a side effect of the patch.

Comments welcome.

Thanx,
Tomar





--- James Smart <[EMAIL PROTECTED]> wrote:

> 
> This sounds like a return to the old behavior, where sdevs in SDEV_DEL
> were ignored. However, it too had lots of bad effects. We'd have to go
> back to the threads over the last 2 years that justified resurrecting
> the sdev. Start looking at threads like :
> http://marc.info/?l=linux-scsi=11788730468=2
> http://marc.info/?l=linux-scsi=116837744314913=2
> http://marc.info/?l=linux-scsi=117139230702785=2
> http://marc.info/?l=linux-scsi=117991046126294=2
> Also, there's multiple parts to this - the sdev struct, and the sysfs objects
> and thus namespace associated with the struct, etc.
> 
> So, in my mind, if this reverts to ignoring sdevs in SDEV_DEL, and creates
> a duplicate sdev in SDEV_RUNNING, then it's the wrong patch.  What should
> be considered is where did the resurrection of the sdev go wrong.  I
> remember that Hannes did some updates
> http://marc.info/?l=linux-scsi=118215727101887=2
> but I don't believe these ever got merged upstream. Perhaps that's a good
> place to start.
> 
> -- james s
> 
> 
> Nagendra Tomar wrote:
> > __scsi_device_lookup and __scsi_device_lookup_by_target do not 
> > check for the sdev_state and hence return scsi_devices with 
> > sdev_state set to SDEV_DEL also. It has the following side effects.
> > 
> > We can have two scsi_devices with the same HBTL queued in 
> > the scsi_host->__devices/scsi_target->devices list, one
> > in the SDEV_DEL state and the other in, say SDEV_RUNNING state

[PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-23 Thread Nagendra Tomar

__scsi_device_lookup and __scsi_device_lookup_by_target do not 
check for the sdev_state and hence return scsi_devices with 
sdev_state set to SDEV_DEL also. It has the following side effects.

We can have two scsi_devices with the same HBTL queued in 
the scsi_host->__devices/scsi_target->devices list, one
in the SDEV_DEL state and the other in, say SDEV_RUNNING state. 
If the one in the SDEV_DEL state is before the one in SDEV_RUNNING 
state, (which will almost always be the case) the scsi_device_lookup and 
scsi_device_lookup_by_target will never find the totally legitimate
scsi_device (the one in the SDEV_RUNNING state).

This is because __scsi_device_lookup/__scsi_device_lookup_by_target 
always returns the first one in the list (which in our case is the 
one with the SDEV_DEL state) and the scsi_device_get() which is called by 
scsi_device_lookup/scsi_device_lookup_by_target will return -ENXIO 
for this scsi_device, resulting in scsi_device_lookup and 
scsi_device_lookup_by_target to return NULL.

So we _cannot_ lookup a perfectly valid device present in the
list of scsi_devices. 

The right thing to do is to not have __scsi_device_lookup
and __scsi_device_lookup_by_target match a device if the scsi_device
state is SDEV_DEL. This will also make these functions similar in 
behaviour to their scsi_device_lookup/scsi_device_lookup_by_target
counterparts, as the comments in the code suggest.

One way by which we can have two scsi_devices in the list is 
as follows.
Suppose a scsi_device has some outstanding command(s) when 
scsi_remove_device is called for it. Due to the extra ref being held
by the command in flight, the __scsi_remove_device->put_device call 
will not actually free the scsi_device and it will remain in the 
scsi_device list albeit in the SDEV_DEL state. Now if we do a 
scsi_add_device for the same HBTL, a new device with the same HBTL
(this one in SDEV_RUNNING state) gets added to the scsi_device list. 

Infact if we call scsi_add_device one more time, it happily 
goes ahead and tries to add it once more, as 
scsi_probe_and_add_lun->scsi_device_lookup_by_target does not return
the already existing device. This will though result in the kobject 
EEXIST warning dump.

The patch below solves the problem described here by not
returning scsi_devices in SDEV_DEL state, thus allowing scsi_device
in SDEV_RUNNING state (if any) to be correctly returned, instead.


Thanx,
Tomar


Signed-off-by: Nagendra Singh Tomar <[EMAIL PROTECTED]>
---

--- linux-2.6.23.14/drivers/scsi/scsi.c.orig2008-01-23 18:06:02.0 
+0530
+++ linux-2.6.23.14/drivers/scsi/scsi.c 2008-01-23 19:17:35.0 +0530
@@ -951,7 +951,7 @@ struct scsi_device *__scsi_device_lookup
struct scsi_device *sdev;
 
list_for_each_entry(sdev, >devices, same_target_siblings) {
-   if (sdev->lun ==lun)
+   if (sdev->lun == lun && sdev->sdev_state != SDEV_DEL)
return sdev;
}
 
@@ -1008,7 +1008,7 @@ struct scsi_device *__scsi_device_lookup
 
list_for_each_entry(sdev, >__devices, siblings) {
if (sdev->channel == channel && sdev->id == id &&
-   sdev->lun ==lun)
+   sdev->lun == lun && sdev->sdev_state != SDEV_DEL)
return sdev;
}



  ___
Support the World Aids Awareness campaign this month with Yahoo! For Good 
http://uk.promotions.yahoo.com/forgood/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-23 Thread Nagendra Tomar

__scsi_device_lookup and __scsi_device_lookup_by_target do not 
check for the sdev_state and hence return scsi_devices with 
sdev_state set to SDEV_DEL also. It has the following side effects.

We can have two scsi_devices with the same HBTL queued in 
the scsi_host-__devices/scsi_target-devices list, one
in the SDEV_DEL state and the other in, say SDEV_RUNNING state. 
If the one in the SDEV_DEL state is before the one in SDEV_RUNNING 
state, (which will almost always be the case) the scsi_device_lookup and 
scsi_device_lookup_by_target will never find the totally legitimate
scsi_device (the one in the SDEV_RUNNING state).

This is because __scsi_device_lookup/__scsi_device_lookup_by_target 
always returns the first one in the list (which in our case is the 
one with the SDEV_DEL state) and the scsi_device_get() which is called by 
scsi_device_lookup/scsi_device_lookup_by_target will return -ENXIO 
for this scsi_device, resulting in scsi_device_lookup and 
scsi_device_lookup_by_target to return NULL.

So we _cannot_ lookup a perfectly valid device present in the
list of scsi_devices. 

The right thing to do is to not have __scsi_device_lookup
and __scsi_device_lookup_by_target match a device if the scsi_device
state is SDEV_DEL. This will also make these functions similar in 
behaviour to their scsi_device_lookup/scsi_device_lookup_by_target
counterparts, as the comments in the code suggest.

One way by which we can have two scsi_devices in the list is 
as follows.
Suppose a scsi_device has some outstanding command(s) when 
scsi_remove_device is called for it. Due to the extra ref being held
by the command in flight, the __scsi_remove_device-put_device call 
will not actually free the scsi_device and it will remain in the 
scsi_device list albeit in the SDEV_DEL state. Now if we do a 
scsi_add_device for the same HBTL, a new device with the same HBTL
(this one in SDEV_RUNNING state) gets added to the scsi_device list. 

Infact if we call scsi_add_device one more time, it happily 
goes ahead and tries to add it once more, as 
scsi_probe_and_add_lun-scsi_device_lookup_by_target does not return
the already existing device. This will though result in the kobject 
EEXIST warning dump.

The patch below solves the problem described here by not
returning scsi_devices in SDEV_DEL state, thus allowing scsi_device
in SDEV_RUNNING state (if any) to be correctly returned, instead.


Thanx,
Tomar


Signed-off-by: Nagendra Singh Tomar [EMAIL PROTECTED]
---

--- linux-2.6.23.14/drivers/scsi/scsi.c.orig2008-01-23 18:06:02.0 
+0530
+++ linux-2.6.23.14/drivers/scsi/scsi.c 2008-01-23 19:17:35.0 +0530
@@ -951,7 +951,7 @@ struct scsi_device *__scsi_device_lookup
struct scsi_device *sdev;
 
list_for_each_entry(sdev, starget-devices, same_target_siblings) {
-   if (sdev-lun ==lun)
+   if (sdev-lun == lun  sdev-sdev_state != SDEV_DEL)
return sdev;
}
 
@@ -1008,7 +1008,7 @@ struct scsi_device *__scsi_device_lookup
 
list_for_each_entry(sdev, shost-__devices, siblings) {
if (sdev-channel == channel  sdev-id == id 
-   sdev-lun ==lun)
+   sdev-lun == lun  sdev-sdev_state != SDEV_DEL)
return sdev;
}



  ___
Support the World Aids Awareness campaign this month with Yahoo! For Good 
http://uk.promotions.yahoo.com/forgood/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23.14] SCSI : scsi_device_lookup/scsi_device_lookup_by_target return NULL for an existent scsi_device

2008-01-23 Thread Nagendra Tomar
Hello James,
 My understanding is that the scsi_device in SDEV_DEL state
is there in the scsi_host-devices/scsi_target-devices queue, just
because there is some outstanding command holding a reference to it.

It will need the device when it completes. Apart from this, for all 
practical purposes the scsi_device is gone from the system. It could
as well be removed from scsi_host-devices/scsi_target-devices lists
and be put in some other list, just to hold the scsi_device till
commands refering to it are completed. 

The scanning code should not consider these devices to be present 
in the system. This is correctly handled today, as 
scsi_device_lookup/scsi_device_lookup_by_target return NULL if there
is an SDEV_DEL device in the list. 

Since a scsi_device in SDEV_DEL state is gone from the system we 
should not hold any fresh references on to this device. The current
scsi_device_lookup/scsi_device_lookup_by_target implementation rightly
ensures that. And since all the sysfs linkages are also removed (in
__scsi_remove_device-device_del), user space can also not reference 
this device. All is good till now.

The problem happens when we try to add a new scsi_device with the same 
HBTL. Since we consider the device gone (as described above) we should
allow a new scsi_device with the same HBTL to be added (to the 
scsi_host-devices/scsi_target-devices list). This part is also correctly
implemented.
scsi_add_device-...-scsi_probe_and_add_lun-scsi_device_lookup_by_target
will _not_ return the device present in the SDEV_DEL state and hence the
scanning will go ahead and try to add a new scsi_device (this one in 
SDEV_RUNNING state) to the devices lists.

All is good even till now.

The PROBLEM is, now any scsi_device_lookup call trying to lookup this newly
added scsi_device, fails. This is because, 
scsi_device_lookup-__scsi_device_lookup
returns the first device that it finds in the list, which in this case
is the one in the SDEV_DEL state. Now the scsi_device_get call that 
scsi_device_lookup makes to get a reference on that device returns ENXIO
as the device is in SDEV_DEL state, resulting in scsi_device_lookup to 
return NULL.

What scsi_device_get does, is right, as we do not want to hold fresh 
references on scsi_devices in SDEV_DEL state. The problem is because
of this we fail to lookup the perfectly legitimate device (in SDEV_RUNNING
state) with the same HBTL sitting in the list.

One of the side effects of this is that the scsi_probe_and_add_lun()
goes ahead with the scanning and tries to add this existent device. 
This is the real problem.

My patch avoids this problem by not breaking from the __scsi_device_lookup
loop, if the device is in SDEV_DEL state. After all we should not consider
these devices to be part of the system. This will allow us to 
find the right scsi_device and this trying to add an existent device
problem will be avoided. 

And also why should scsi_device_lookup and __scsi_device_lookup be 
different in behaviour. One returns devices in SDEV_DEL state, the other
doesn't. The comments suggest that they can be used interchangibly, but
for the locking and the extra reference that the scsi_device_lookup holds.

This is fixed as a side effect of the patch.

Comments welcome.

Thanx,
Tomar





--- James Smart [EMAIL PROTECTED] wrote:

 
 This sounds like a return to the old behavior, where sdevs in SDEV_DEL
 were ignored. However, it too had lots of bad effects. We'd have to go
 back to the threads over the last 2 years that justified resurrecting
 the sdev. Start looking at threads like :
 http://marc.info/?l=linux-scsim=11788730468w=2
 http://marc.info/?l=linux-scsim=116837744314913w=2
 http://marc.info/?l=linux-scsim=117139230702785w=2
 http://marc.info/?l=linux-scsim=117991046126294w=2
 Also, there's multiple parts to this - the sdev struct, and the sysfs objects
 and thus namespace associated with the struct, etc.
 
 So, in my mind, if this reverts to ignoring sdevs in SDEV_DEL, and creates
 a duplicate sdev in SDEV_RUNNING, then it's the wrong patch.  What should
 be considered is where did the resurrection of the sdev go wrong.  I
 remember that Hannes did some updates
 http://marc.info/?l=linux-scsim=118215727101887w=2
 but I don't believe these ever got merged upstream. Perhaps that's a good
 place to start.
 
 -- james s
 
 
 Nagendra Tomar wrote:
  __scsi_device_lookup and __scsi_device_lookup_by_target do not 
  check for the sdev_state and hence return scsi_devices with 
  sdev_state set to SDEV_DEL also. It has the following side effects.
  
  We can have two scsi_devices with the same HBTL queued in 
  the scsi_host-__devices/scsi_target-devices list, one
  in the SDEV_DEL state and the other in, say SDEV_RUNNING state. 
  If the one in the SDEV_DEL state is before the one in SDEV_RUNNING 
  state, (which will almost always be the case) the scsi_device_lookup and 
  scsi_device_lookup_by_target will never find the totally legitimate
  scsi_device

Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi <[EMAIL PROTECTED]> wrote:


> That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
> ability to write, and it is not meant as to signal every time a packet 
> (skb) is sent on the wire (and the buffer released).

Aren't they both the same ? Everytime an incoming ACK frees up a buffer
from the retransmit queue, the writability condition is freshly asserted,
much the same way as the readability condition is asserted everytime a 
new data is queued in the socket receive queue (irrespective of 
whether there was data already waiting to be read in the receive queue).

This difference in meaning of POLLOUT only arises in the ET case, which was
not what traditional Unix poll referred to. 

Since its a new game the rules can be modified (ofcourse based on the 
merits i.e. usability)

Thanx,
Tomar





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi <[EMAIL PROTECTED]> wrote:

> 
> Unfortunately f_op->poll() does not let the caller to specify the events 
> it's interested in, that would allow to split send/recevie wait queues and 
> better detect read/write cases.
> The detection of a waitqueue_active(->sk_wr_sleep) would work fine in 
> detecting is someone is actually waiting for a write, w/out the false 
> positives triggered by the read-waiters.
> That would be a very sane thing to do, but would require a big change 
> to all the ->poll around (that could be automated by a script - devices 
> not caring about the events hint can just continue to use the single queue 
> like they currently do), and a more critical and gradual change of all the 
> devices that wants to take advantage of it.
> That way, no more magic bits are needed, and a simple waitqueue_active() 
> would tell you if someone is waiting for write-space events.
> 

I like this.





  ___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi <[EMAIL PROTECTED]> wrote:

> Looking back at it, I think the current TCP code is right, once you look 
> at the "event" to be a output buffer full->with_space transition.
> If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
> (free space on the output buffer), if you do not consume it (say a 
> tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
> anymore since they happen on the full->with_space transition.
> Yes, I know, the read size (EPOLLIN) works differently and you get an 
> event for every packet you receive. And yes, I do not like asymmetric 
> things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.
> 

I agree that ET means the event should happen at the transition
from nospace->space condition, but isn't the other case (event is 
delivered every time the event actually happens) more usable.
Also the epoll man page says so

"... Edge Triggered event distribution delivers events only when 
events happens on the  monitored file."

This serves the purpose of ET (reducing the number of poll events) and
at the same time makes userspace coding easier. My userspace code
has the liberty of deciding when it can write to the socket. f.e. the
sendfile buffer management example that I quoted in my earlier post
will be difficult with the current ET|POLLOUT behaviour. I cannot 
write in full-buffer units. I'll ve to write partial buffers just to 
fill the TCP writeq which is needed to trigger the event.

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Eric Dumazet <[EMAIL PROTECTED]> wrote:

> Nagendra Tomar a écrit :
> > --- Davide Libenzi <[EMAIL PROTECTED]> wrote:
> > 
> >> On Wed, 19 Sep 2007, David Miller wrote:
> >>
> >>> From: Nagendra Tomar <[EMAIL PROTECTED]>
> >>> Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
> >>>
> >>>> With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call 
> >>>> will 
> >>>> not return, even when the incoming acks free the buffers.
> >>>> Note that this patch assumes that the SOCK_NOSPACE check in
> >>>> tcp_check_space is a trivial optimization which can be safely removed.
> >>> I already replied to your patch posting explaining that whatever is
> >>> not setting SOCK_NOSPACE should be fixed instead.
> >>>
> >>> Please address that, thanks.
> >> You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
> >> completely device-unaware interface like epoll, I hope?
> >>
> > 
> > Definitely not ! 
> > 
> > The point is that the "tcp write space available" 
> > wakeup does not get called if SOCK_NOSPACE bit is not set. This was
> > fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
> > indicated that someone really cared abt the wakeup). Now after the
> > introduction of callback'ed wakeups, we might have some work to
> > do inside the callback even if there is nobody interested in the wakeup
> > at that point of time. 
> > 
> > In this particular case the ep_poll_callback is not getting called and
> > hence the socket fd is not getting added to the ready list.
> > 
> 
> Does it means that with your patch each ACK on a ET managed socket will 
> trigger an epoll event   ?
> 
> Maybe your very sensitive high throuput appication needs to set a flag or 
> something at socket level to ask for such a behavior.
> 
> The default should stay as is. That is an event should be sent only if 
> someone 
> cared about the wakeup.
> 

A high throughput app will always care about the wakeup, or else it will 
not be a high throughput app in the first place. An application that
occasionaly writes and then goes to slumber and then writes again will
not be a high throughput app. 

My point is that the SOCK_NOSPACE check does not save us much. For
high throughput app it will almost always be set, thus making the 
check insignificant, and for the low throughput case we care less.

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Eric Dumazet [EMAIL PROTECTED] wrote:

 Nagendra Tomar a écrit :
  --- Davide Libenzi [EMAIL PROTECTED] wrote:
  
  On Wed, 19 Sep 2007, David Miller wrote:
 
  From: Nagendra Tomar [EMAIL PROTECTED]
  Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
 
  With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call 
  will 
  not return, even when the incoming acks free the buffers.
  Note that this patch assumes that the SOCK_NOSPACE check in
  tcp_check_space is a trivial optimization which can be safely removed.
  I already replied to your patch posting explaining that whatever is
  not setting SOCK_NOSPACE should be fixed instead.
 
  Please address that, thanks.
  You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
  completely device-unaware interface like epoll, I hope?
 
  
  Definitely not ! 
  
  The point is that the tcp write space available 
  wakeup does not get called if SOCK_NOSPACE bit is not set. This was
  fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
  indicated that someone really cared abt the wakeup). Now after the
  introduction of callback'ed wakeups, we might have some work to
  do inside the callback even if there is nobody interested in the wakeup
  at that point of time. 
  
  In this particular case the ep_poll_callback is not getting called and
  hence the socket fd is not getting added to the ready list.
  
 
 Does it means that with your patch each ACK on a ET managed socket will 
 trigger an epoll event   ?
 
 Maybe your very sensitive high throuput appication needs to set a flag or 
 something at socket level to ask for such a behavior.
 
 The default should stay as is. That is an event should be sent only if 
 someone 
 cared about the wakeup.
 

A high throughput app will always care about the wakeup, or else it will 
not be a high throughput app in the first place. An application that
occasionaly writes and then goes to slumber and then writes again will
not be a high throughput app. 

My point is that the SOCK_NOSPACE check does not save us much. For
high throughput app it will almost always be set, thus making the 
check insignificant, and for the low throughput case we care less.

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi [EMAIL PROTECTED] wrote:

 Looking back at it, I think the current TCP code is right, once you look 
 at the event to be a output buffer full-with_space transition.
 If you drop an fd inside epoll with EPOLLOUT|EPOLLET and you get an event 
 (free space on the output buffer), if you do not consume it (say a 
 tcp_sendmsg that re-fill the buffer), you can't see other OUT event 
 anymore since they happen on the full-with_space transition.
 Yes, I know, the read size (EPOLLIN) works differently and you get an 
 event for every packet you receive. And yes, I do not like asymmetric 
 things. But that does not make the EPOLLOUT|EPOLLET wrong IMO.
 

I agree that ET means the event should happen at the transition
from nospace-space condition, but isn't the other case (event is 
delivered every time the event actually happens) more usable.
Also the epoll man page says so

... Edge Triggered event distribution delivers events only when 
events happens on the  monitored file.

This serves the purpose of ET (reducing the number of poll events) and
at the same time makes userspace coding easier. My userspace code
has the liberty of deciding when it can write to the socket. f.e. the
sendfile buffer management example that I quoted in my earlier post
will be difficult with the current ET|POLLOUT behaviour. I cannot 
write in full-buffer units. I'll ve to write partial buffers just to 
fill the TCP writeq which is needed to trigger the event.

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi [EMAIL PROTECTED] wrote:

 
 Unfortunately f_op-poll() does not let the caller to specify the events 
 it's interested in, that would allow to split send/recevie wait queues and 
 better detect read/write cases.
 The detection of a waitqueue_active(-sk_wr_sleep) would work fine in 
 detecting is someone is actually waiting for a write, w/out the false 
 positives triggered by the read-waiters.
 That would be a very sane thing to do, but would require a bigdumb change 
 to all the -poll around (that could be automated by a script - devices 
 not caring about the events hint can just continue to use the single queue 
 like they currently do), and a more critical and gradual change of all the 
 devices that wants to take advantage of it.
 That way, no more magic bits are needed, and a simple waitqueue_active() 
 would tell you if someone is waiting for write-space events.
 

I like this.





  ___ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-20 Thread Nagendra Tomar

--- Davide Libenzi [EMAIL PROTECTED] wrote:


 That's not what POLLOUT means in the Unix meaning. POLLOUT indicates the 
 ability to write, and it is not meant as to signal every time a packet 
 (skb) is sent on the wire (and the buffer released).

Aren't they both the same ? Everytime an incoming ACK frees up a buffer
from the retransmit queue, the writability condition is freshly asserted,
much the same way as the readability condition is asserted everytime a 
new data is queued in the socket receive queue (irrespective of 
whether there was data already waiting to be read in the receive queue).

This difference in meaning of POLLOUT only arises in the ET case, which was
not what traditional Unix poll referred to. 

Since its a new game the rules can be modified (ofcourse based on the 
merits i.e. usability)

Thanx,
Tomar





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- Davide Libenzi <[EMAIL PROTECTED]> wrote:

> On Wed, 19 Sep 2007, David Miller wrote:
> 
> > From: Nagendra Tomar <[EMAIL PROTECTED]>
> > Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
> > 
> > > With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call 
> > > will 
> > > not return, even when the incoming acks free the buffers.
> > > Note that this patch assumes that the SOCK_NOSPACE check in
> > > tcp_check_space is a trivial optimization which can be safely removed.
> > 
> > I already replied to your patch posting explaining that whatever is
> > not setting SOCK_NOSPACE should be fixed instead.
> > 
> > Please address that, thanks.
> 
> You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
> completely device-unaware interface like epoll, I hope?
> 

Definitely not ! 

The point is that the "tcp write space available" 
wakeup does not get called if SOCK_NOSPACE bit is not set. This was
fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
indicated that someone really cared abt the wakeup). Now after the
introduction of callback'ed wakeups, we might have some work to
do inside the callback even if there is nobody interested in the wakeup
at that point of time. 

In this particular case the ep_poll_callback is not getting called and
hence the socket fd is not getting added to the ready list.

Thanx,
Tomar




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- David Miller <[EMAIL PROTECTED]> wrote:

> From: Nagendra Tomar <[EMAIL PROTECTED]>
> Date: Wed, 19 Sep 2007 15:55:58 -0700 (PDT)
> 
> >  I agree that setting SOCK_NOSPACE would have been a more elegant
> > fix. Infact I thought a lot about that before deciding on this fix.
> 
> I guess this means you also noticed that you are removing
> the one and only test of this bit too?
> 
> You can't remove this, it's critical for performance.

I'm sure you would have seen value in the check that's why the 
check is there. 

Now we have two critical points to discuss

1. How can we achieve the ET EPOLLOUT event with the SOCK_NOSPACE
   check in place ?
2. How much effect will removing the check have (if we cannot 
   find a way to get the ET EPOLLOUT notification w/ the check
   in place) ? 

Regding (2), IMHO for a "fast sender" the SOCK_NOSPACE check will
almost always pass as the sender will come back to write (or poll)
before the prev data is drained out. If he doesn't do that, he is
not a "fast sender" by definition". A "fast sender" should always
have some data to send when he practically (per the sndbuf space) 
can.

For a "slow sender", do we really care abt the optimization ?

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- David Miller <[EMAIL PROTECTED]> wrote:

> From: Nagendra Tomar <[EMAIL PROTECTED]>
> Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
> 
> > With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
> > not return, even when the incoming acks free the buffers.
> > Note that this patch assumes that the SOCK_NOSPACE check in
> > tcp_check_space is a trivial optimization which can be safely removed.
> 
> I already replied to your patch posting explaining that whatever is
> not setting SOCK_NOSPACE should be fixed instead.
> 
> Please address that, thanks.

Dave, 
 I agree that setting SOCK_NOSPACE would have been a more elegant
fix. Infact I thought a lot about that before deciding on this fix.
But the point here is that the SOCK_NOSPACE bit can be set when 
the sndbuf space is really less (less than sk_stream_min_wspace())
and some user action (sendmsg or poll) indicated his intent to write.
In the case mentioned none of these is true. Since user wants
to manage his tranmit buffers himself, his definition of less may
not match with what kernel feels is less. f.e. user might have 
dynamically changing mmap'ed buffer resources at his disposal 
which he wants to use as sendfile buffers. He wants to be notified
whenever a new incoming ack frees up one or more of his buffers,
so that he can reuse that buffer.
The bigger problem is that user is not indicating his intent
to write, to the kernel. He is just watching the sendbuf space and
when it matches his needs he will send new data. 

Thanx,
Tomar

 



  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar
The tcp_check_space() function calls tcp_new_space() only if the
SOCK_NOSPACE bit is set in the socket flags. This is causing Edge Triggered
EPOLLOUT events to be missed for TCP sockets, as the ep_poll_callback() 
is not called from the wakeup routine.

The SOCK_NOSPACE bit indicates the user's intent to perform writes
on that socket (set in tcp_sendmsg and tcp_poll). I believe the idea 
behind the SOCK_NOSPACE check is to optimize away the tcp_new_space call
in cases when user is not interested in writing to the socket. These two
take care of all possible scenarios in which a user can convey his intent
to write on that socket.

Case 1: tcp_sendmsg detects lack of sndbuf space
Case 2: tcp_poll returns not writable

This is fine if we do not deal with epoll's Edge Triggered events (EPOLLET).
With ET events we can have a scenario where the SOCK_NOSPACE bit is not set,
as the user has neither done a sendmsg nor a poll/epoll call that returned
with the POLLOUT condition not set. 

In this case the user will _never_ get an ET POLLOUT event since 
tcp_check_space() will not call tcp_new_space() (as the SOCK_NOSPACE bit is 
not set), which does the real work. THIS IS AGAINST THE EPOLL ET PROMISE OF
DELIVERING AN EVENT WHENEVER THE EVENT ACTUALLY HAPPENS. 

This ET event will be very helpful to implement user level memory management
for mmap+sendfile zero copy Tx. So typically the application does this

void *alloc_sendfile_buf(void)
{
while(!next_free_buffer)
{
/*
 * No free buffers (all are dispatched to sendfile and are 
 * in use). Wait for one or more buffers to become free
 * The socket fd is registered with EPOLLET|EPOLLOUT events.
 * EPOLLET enables us to check for SIOCOUTQ only when some
 * more space becomes available.
 *
 * One would expect the ET EPOLLOUT event to be notified 
 * when TCP space is freed due to some ack coming in. 
 */
epoll_wait(...); /* wait for some incoming ack to free some
buffer from the retransmit queue */
ioctl(fd, SIOCOUTQ, _outq);
/*
 * see if we can mark some more "complete" buffers free
 * If it can mark one or more buffer free, it will set
 * next_free_buffer to point to the available buffer to use
 */
rehash_free_buffers(in_outq);
}
return next_free_buffer;
}

With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
not return, even when the incoming acks free the buffers.
Note that this patch assumes that the SOCK_NOSPACE check in
tcp_check_space is a trivial optimization which can be safely removed.

Thanx,
Tomar


Signed-off-by: Nagendra Singh Tomar <[EMAIL PROTECTED]>
---

--- linux-2.6.23-rc6/net/ipv4/tcp_input.c.orig  2007-09-19 13:58:44.0 
+0530
+++ linux-2.6.23-rc6/net/ipv4/tcp_input.c   2007-09-19 10:17:36.0 
+0530
@@ -3929,8 +3929,7 @@ static void tcp_check_space(struct sock 
 {
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
-   if (sk->sk_socket &&
-   test_bit(SOCK_NOSPACE, >sk_socket->flags))
+   if (sk->sk_socket)
tcp_new_space(sk);
}
 }


  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar
The tcp_check_space() function calls tcp_new_space() only if the
SOCK_NOSPACE bit is set in the socket flags. This is causing Edge Triggered
EPOLLOUT events to be missed for TCP sockets, as the ep_poll_callback() 
is not called from the wakeup routine.

The SOCK_NOSPACE bit indicates the user's intent to perform writes
on that socket (set in tcp_sendmsg and tcp_poll). I believe the idea 
behind the SOCK_NOSPACE check is to optimize away the tcp_new_space call
in cases when user is not interested in writing to the socket. These two
take care of all possible scenarios in which a user can convey his intent
to write on that socket.

Case 1: tcp_sendmsg detects lack of sndbuf space
Case 2: tcp_poll returns not writable

This is fine if we do not deal with epoll's Edge Triggered events (EPOLLET).
With ET events we can have a scenario where the SOCK_NOSPACE bit is not set,
as the user has neither done a sendmsg nor a poll/epoll call that returned
with the POLLOUT condition not set. 

In this case the user will _never_ get an ET POLLOUT event since 
tcp_check_space() will not call tcp_new_space() (as the SOCK_NOSPACE bit is 
not set), which does the real work. THIS IS AGAINST THE EPOLL ET PROMISE OF
DELIVERING AN EVENT WHENEVER THE EVENT ACTUALLY HAPPENS. 

This ET event will be very helpful to implement user level memory management
for mmap+sendfile zero copy Tx. So typically the application does this

void *alloc_sendfile_buf(void)
{
while(!next_free_buffer)
{
/*
 * No free buffers (all are dispatched to sendfile and are 
 * in use). Wait for one or more buffers to become free
 * The socket fd is registered with EPOLLET|EPOLLOUT events.
 * EPOLLET enables us to check for SIOCOUTQ only when some
 * more space becomes available.
 *
 * One would expect the ET EPOLLOUT event to be notified 
 * when TCP space is freed due to some ack coming in. 
 */
epoll_wait(...); /* wait for some incoming ack to free some
buffer from the retransmit queue */
ioctl(fd, SIOCOUTQ, in_outq);
/*
 * see if we can mark some more complete buffers free
 * If it can mark one or more buffer free, it will set
 * next_free_buffer to point to the available buffer to use
 */
rehash_free_buffers(in_outq);
}
return next_free_buffer;
}

With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
not return, even when the incoming acks free the buffers.
Note that this patch assumes that the SOCK_NOSPACE check in
tcp_check_space is a trivial optimization which can be safely removed.

Thanx,
Tomar


Signed-off-by: Nagendra Singh Tomar [EMAIL PROTECTED]
---

--- linux-2.6.23-rc6/net/ipv4/tcp_input.c.orig  2007-09-19 13:58:44.0 
+0530
+++ linux-2.6.23-rc6/net/ipv4/tcp_input.c   2007-09-19 10:17:36.0 
+0530
@@ -3929,8 +3929,7 @@ static void tcp_check_space(struct sock 
 {
if (sock_flag(sk, SOCK_QUEUE_SHRUNK)) {
sock_reset_flag(sk, SOCK_QUEUE_SHRUNK);
-   if (sk-sk_socket 
-   test_bit(SOCK_NOSPACE, sk-sk_socket-flags))
+   if (sk-sk_socket)
tcp_new_space(sk);
}
 }


  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- David Miller [EMAIL PROTECTED] wrote:

 From: Nagendra Tomar [EMAIL PROTECTED]
 Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
 
  With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call will 
  not return, even when the incoming acks free the buffers.
  Note that this patch assumes that the SOCK_NOSPACE check in
  tcp_check_space is a trivial optimization which can be safely removed.
 
 I already replied to your patch posting explaining that whatever is
 not setting SOCK_NOSPACE should be fixed instead.
 
 Please address that, thanks.

Dave, 
 I agree that setting SOCK_NOSPACE would have been a more elegant
fix. Infact I thought a lot about that before deciding on this fix.
But the point here is that the SOCK_NOSPACE bit can be set when 
the sndbuf space is really less (less than sk_stream_min_wspace())
and some user action (sendmsg or poll) indicated his intent to write.
In the case mentioned none of these is true. Since user wants
to manage his tranmit buffers himself, his definition of less may
not match with what kernel feels is less. f.e. user might have 
dynamically changing mmap'ed buffer resources at his disposal 
which he wants to use as sendfile buffers. He wants to be notified
whenever a new incoming ack frees up one or more of his buffers,
so that he can reuse that buffer.
The bigger problem is that user is not indicating his intent
to write, to the kernel. He is just watching the sendbuf space and
when it matches his needs he will send new data. 

Thanx,
Tomar

 



  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- David Miller [EMAIL PROTECTED] wrote:

 From: Nagendra Tomar [EMAIL PROTECTED]
 Date: Wed, 19 Sep 2007 15:55:58 -0700 (PDT)
 
   I agree that setting SOCK_NOSPACE would have been a more elegant
  fix. Infact I thought a lot about that before deciding on this fix.
 
 I guess this means you also noticed that you are removing
 the one and only test of this bit too?
 
 You can't remove this, it's critical for performance.

I'm sure you would have seen value in the check that's why the 
check is there. 

Now we have two critical points to discuss

1. How can we achieve the ET EPOLLOUT event with the SOCK_NOSPACE
   check in place ?
2. How much effect will removing the check have (if we cannot 
   find a way to get the ET EPOLLOUT notification w/ the check
   in place) ? 

Regding (2), IMHO for a fast sender the SOCK_NOSPACE check will
almost always pass as the sender will come back to write (or poll)
before the prev data is drained out. If he doesn't do that, he is
not a fast sender by definition. A fast sender should always
have some data to send when he practically (per the sndbuf space) 
can.

For a slow sender, do we really care abt the optimization ?

Thanx,
Tomar



  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.23-rc6 Resending] NETWORKING : Edge Triggered EPOLLOUT events get missed for TCP sockets

2007-09-19 Thread Nagendra Tomar

--- Davide Libenzi [EMAIL PROTECTED] wrote:

 On Wed, 19 Sep 2007, David Miller wrote:
 
  From: Nagendra Tomar [EMAIL PROTECTED]
  Date: Wed, 19 Sep 2007 15:37:09 -0700 (PDT)
  
   With the SOCK_NOSPACE check in tcp_check_space(), this epoll_wait call 
   will 
   not return, even when the incoming acks free the buffers.
   Note that this patch assumes that the SOCK_NOSPACE check in
   tcp_check_space is a trivial optimization which can be safely removed.
  
  I already replied to your patch posting explaining that whatever is
  not setting SOCK_NOSPACE should be fixed instead.
  
  Please address that, thanks.
 
 You're not planning of putting the notion of a SOCK_NOSPACE bit inside a 
 completely device-unaware interface like epoll, I hope?
 

Definitely not ! 

The point is that the tcp write space available 
wakeup does not get called if SOCK_NOSPACE bit is not set. This was
fine when the wakeup was merely a wakeup (since SOCK_NOSPACE bit 
indicated that someone really cared abt the wakeup). Now after the
introduction of callback'ed wakeups, we might have some work to
do inside the callback even if there is nobody interested in the wakeup
at that point of time. 

In this particular case the ep_poll_callback is not getting called and
hence the socket fd is not getting added to the ready list.

Thanx,
Tomar




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/