[PATCH 2/2] ib fmr pool: flush used clean entries

2008-02-26 Thread Pete Wyckoff
Commit a3cd7d9070be417a21905c997ee32d756d999b38 (IB/fmr_pool:
ib_fmr_pool_flush() should flush all dirty FMRs) caused a
regression for iSER and was reverted in
e5507736c6449b3253347eed6f8ea77a28cf688e.

This change attempts to redo the original patch so that all used
FMR entries are flushed when ib_flush_fmr_pool() is called,
and other FMR users are not affected.  Simply move used entries
from the clean list onto the dirty list before letting the
cleanup thread do its job.

Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]>
---
 drivers/infiniband/core/fmr_pool.c |   17 -
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/fmr_pool.c 
b/drivers/infiniband/core/fmr_pool.c
index 4044fdf..06d502c 100644
--- a/drivers/infiniband/core/fmr_pool.c
+++ b/drivers/infiniband/core/fmr_pool.c
@@ -398,8 +398,23 @@ EXPORT_SYMBOL(ib_destroy_fmr_pool);
  */
 int ib_flush_fmr_pool(struct ib_fmr_pool *pool)
 {
-   int serial = atomic_inc_return(>req_ser);
+   int serial;
+   struct ib_pool_fmr *fmr, *next;
+
+   /*
+* The free_list holds FMRs that may have been used
+* but have not been remapped enough times to be dirty.
+* Put them on the dirty list now so that the cleanup
+* thread will reap them too.
+*/
+   spin_lock_irq(>pool_lock);
+   list_for_each_entry_safe(fmr, next, >free_list, list) {
+   if (fmr->remap_count > 0)
+   list_move(>list, >dirty_list);
+   }
+   spin_unlock_irq(>pool_lock);
 
+   serial = atomic_inc_return(>req_ser);
wake_up_process(pool->thread);
 
if (wait_event_interruptible(pool->force_wait,
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Revert "IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs"

2008-02-26 Thread Pete Wyckoff
This reverts commit a3cd7d9070be417a21905c997ee32d756d999b38.

The original commit breaks iSER reliably, making it complain:

iser: iser_reg_page_vec:ib_fmr_pool_map_phys failed: -11

The FMR cleanup thread runs ib_fmr_batch_release() as dirty
entries build up.  This commit causes clean but used FMR
entries also to be purged.  During that process, another thread
can see that there are no free FMRs and fail, even though
there should always have been enough available.

Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]>
---
 drivers/infiniband/core/fmr_pool.c |   21 ++---
 1 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/fmr_pool.c 
b/drivers/infiniband/core/fmr_pool.c
index 7f00347..4044fdf 100644
--- a/drivers/infiniband/core/fmr_pool.c
+++ b/drivers/infiniband/core/fmr_pool.c
@@ -139,7 +139,7 @@ static inline struct ib_pool_fmr 
*ib_fmr_cache_lookup(struct ib_fmr_pool *pool,
 static void ib_fmr_batch_release(struct ib_fmr_pool *pool)
 {
int ret;
-   struct ib_pool_fmr *fmr, *next;
+   struct ib_pool_fmr *fmr;
LIST_HEAD(unmap_list);
LIST_HEAD(fmr_list);
 
@@ -158,20 +158,6 @@ static void ib_fmr_batch_release(struct ib_fmr_pool *pool)
 #endif
}
 
-   /*
-* The free_list may hold FMRs that have been put there
-* because they haven't reached the max_remap count.
-* Invalidate their mapping as well.
-*/
-   list_for_each_entry_safe(fmr, next, >free_list, list) {
-   if (fmr->remap_count == 0)
-   continue;
-   hlist_del_init(>cache_node);
-   fmr->remap_count = 0;
-   list_add_tail(>fmr->list, _list);
-   list_move(>list, _list);
-   }
-
list_splice(>dirty_list, _list);
INIT_LIST_HEAD(>dirty_list);
pool->dirty_len = 0;
@@ -384,6 +370,11 @@ void ib_destroy_fmr_pool(struct ib_fmr_pool *pool)
 
i = 0;
list_for_each_entry_safe(fmr, tmp, >free_list, list) {
+   if (fmr->remap_count) {
+   INIT_LIST_HEAD(_list);
+   list_add_tail(>fmr->list, _list);
+   ib_unmap_fmr(_list);
+   }
ib_dealloc_fmr(fmr->fmr);
list_del(>list);
kfree(fmr);
-- 
1.5.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] (was Re: [ofa-general] fmr pool free_list empty)

2008-02-26 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Mon, 25 Feb 2008 15:02 -0800:
> Ugh.
[pw wrote:]
>  > Looking at the FMR dirty list unmapping code in
>  > ib_fmr_batch_release(), there is a section that pulls all the dirty
>  > entries onto a list that it will later unmap and put back on the
>  > free list.
> 
>  > But it also plans to unmap all the free entries that have ever been
>  > remapped:
> 
> Yes, this came from a3cd7d90 ("IB/fmr_pool: ib_fmr_pool_flush() should
> flush all dirty FMRs").  That solved a real problem for Olaf, because
> otherwise dirty FMRs with not at the max map count might never get
> invalidated.  It's not exactly an optimization but rather a
> correctness issue, because RDS relies on killing mapping eventually.
> 
> On the other hand, this behavior clearly does lead to the possibility
> of leaving the free list temporarily empty for stupid reasons.
> 
> I don't see a really good way to fix this at the momemnt, need to
> meditate a little.

Adding CCs in case some iser users are not on the openfabrics list.
Original message is here:
http://lists.openfabrics.org/pipermail/general/2008-February/047111.html

This quoted commit is a regression for iSER.  Not sure if it causes
problems for the other FMR user, SRP.  It went in after v2.6.24.
Following this mail are two patches.  One to revert the change, and
one to attempt to do Olaf's patch in such a way that it does not
cause problems for other FMR users.

I haven't tested the patches with RDS.  It apparently isn't in the
tree yet.  In fact, there are no users of ib_flush_fmr_pool() in the
tree, which is the only function affected by the second patch.  But
iSER is working again in my scenario.

As a side note, I don't remember seeing this patch on the
openfabrics mailing list.  Perhaps I missed it.  Sometimes these
sorts of interactions can be spotted if proposed changes get wider
attention.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] (was Re: [ofa-general] fmr pool free_list empty)

2008-02-26 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Mon, 25 Feb 2008 15:02 -0800:
 Ugh.
[pw wrote:]
   Looking at the FMR dirty list unmapping code in
   ib_fmr_batch_release(), there is a section that pulls all the dirty
   entries onto a list that it will later unmap and put back on the
   free list.
 
   But it also plans to unmap all the free entries that have ever been
   remapped:
 
 Yes, this came from a3cd7d90 (IB/fmr_pool: ib_fmr_pool_flush() should
 flush all dirty FMRs).  That solved a real problem for Olaf, because
 otherwise dirty FMRs with not at the max map count might never get
 invalidated.  It's not exactly an optimization but rather a
 correctness issue, because RDS relies on killing mapping eventually.
 
 On the other hand, this behavior clearly does lead to the possibility
 of leaving the free list temporarily empty for stupid reasons.
 
 I don't see a really good way to fix this at the momemnt, need to
 meditate a little.

Adding CCs in case some iser users are not on the openfabrics list.
Original message is here:
http://lists.openfabrics.org/pipermail/general/2008-February/047111.html

This quoted commit is a regression for iSER.  Not sure if it causes
problems for the other FMR user, SRP.  It went in after v2.6.24.
Following this mail are two patches.  One to revert the change, and
one to attempt to do Olaf's patch in such a way that it does not
cause problems for other FMR users.

I haven't tested the patches with RDS.  It apparently isn't in the
tree yet.  In fact, there are no users of ib_flush_fmr_pool() in the
tree, which is the only function affected by the second patch.  But
iSER is working again in my scenario.

As a side note, I don't remember seeing this patch on the
openfabrics mailing list.  Perhaps I missed it.  Sometimes these
sorts of interactions can be spotted if proposed changes get wider
attention.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] Revert IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs

2008-02-26 Thread Pete Wyckoff
This reverts commit a3cd7d9070be417a21905c997ee32d756d999b38.

The original commit breaks iSER reliably, making it complain:

iser: iser_reg_page_vec:ib_fmr_pool_map_phys failed: -11

The FMR cleanup thread runs ib_fmr_batch_release() as dirty
entries build up.  This commit causes clean but used FMR
entries also to be purged.  During that process, another thread
can see that there are no free FMRs and fail, even though
there should always have been enough available.

Signed-off-by: Pete Wyckoff [EMAIL PROTECTED]
---
 drivers/infiniband/core/fmr_pool.c |   21 ++---
 1 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/fmr_pool.c 
b/drivers/infiniband/core/fmr_pool.c
index 7f00347..4044fdf 100644
--- a/drivers/infiniband/core/fmr_pool.c
+++ b/drivers/infiniband/core/fmr_pool.c
@@ -139,7 +139,7 @@ static inline struct ib_pool_fmr 
*ib_fmr_cache_lookup(struct ib_fmr_pool *pool,
 static void ib_fmr_batch_release(struct ib_fmr_pool *pool)
 {
int ret;
-   struct ib_pool_fmr *fmr, *next;
+   struct ib_pool_fmr *fmr;
LIST_HEAD(unmap_list);
LIST_HEAD(fmr_list);
 
@@ -158,20 +158,6 @@ static void ib_fmr_batch_release(struct ib_fmr_pool *pool)
 #endif
}
 
-   /*
-* The free_list may hold FMRs that have been put there
-* because they haven't reached the max_remap count.
-* Invalidate their mapping as well.
-*/
-   list_for_each_entry_safe(fmr, next, pool-free_list, list) {
-   if (fmr-remap_count == 0)
-   continue;
-   hlist_del_init(fmr-cache_node);
-   fmr-remap_count = 0;
-   list_add_tail(fmr-fmr-list, fmr_list);
-   list_move(fmr-list, unmap_list);
-   }
-
list_splice(pool-dirty_list, unmap_list);
INIT_LIST_HEAD(pool-dirty_list);
pool-dirty_len = 0;
@@ -384,6 +370,11 @@ void ib_destroy_fmr_pool(struct ib_fmr_pool *pool)
 
i = 0;
list_for_each_entry_safe(fmr, tmp, pool-free_list, list) {
+   if (fmr-remap_count) {
+   INIT_LIST_HEAD(fmr_list);
+   list_add_tail(fmr-fmr-list, fmr_list);
+   ib_unmap_fmr(fmr_list);
+   }
ib_dealloc_fmr(fmr-fmr);
list_del(fmr-list);
kfree(fmr);
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] ib fmr pool: flush used clean entries

2008-02-26 Thread Pete Wyckoff
Commit a3cd7d9070be417a21905c997ee32d756d999b38 (IB/fmr_pool:
ib_fmr_pool_flush() should flush all dirty FMRs) caused a
regression for iSER and was reverted in
e5507736c6449b3253347eed6f8ea77a28cf688e.

This change attempts to redo the original patch so that all used
FMR entries are flushed when ib_flush_fmr_pool() is called,
and other FMR users are not affected.  Simply move used entries
from the clean list onto the dirty list before letting the
cleanup thread do its job.

Signed-off-by: Pete Wyckoff [EMAIL PROTECTED]
---
 drivers/infiniband/core/fmr_pool.c |   17 -
 1 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/fmr_pool.c 
b/drivers/infiniband/core/fmr_pool.c
index 4044fdf..06d502c 100644
--- a/drivers/infiniband/core/fmr_pool.c
+++ b/drivers/infiniband/core/fmr_pool.c
@@ -398,8 +398,23 @@ EXPORT_SYMBOL(ib_destroy_fmr_pool);
  */
 int ib_flush_fmr_pool(struct ib_fmr_pool *pool)
 {
-   int serial = atomic_inc_return(pool-req_ser);
+   int serial;
+   struct ib_pool_fmr *fmr, *next;
+
+   /*
+* The free_list holds FMRs that may have been used
+* but have not been remapped enough times to be dirty.
+* Put them on the dirty list now so that the cleanup
+* thread will reap them too.
+*/
+   spin_lock_irq(pool-pool_lock);
+   list_for_each_entry_safe(fmr, next, pool-free_list, list) {
+   if (fmr-remap_count  0)
+   list_move(fmr-list, pool-dirty_list);
+   }
+   spin_unlock_irq(pool-pool_lock);
 
+   serial = atomic_inc_return(pool-req_ser);
wake_up_process(pool-thread);
 
if (wait_event_interruptible(pool-force_wait,
-- 
1.5.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800:
> One other area that has not been brought up yet (I think) is the
> applicability of notifiers in letting users know when pinned memory
> is reclaimed by the kernel.  This is useful when a lower-level
> library employs lazy deregistration strategies on memory regions that
> are subsequently released to the kernel via the application's use of
> munmap or sbrk.  Ohio Supercomputing Center has work in this area but
> a generalized approach in the kernel would certainly be welcome.

The whole need for memory registration is a giant pain.  There is no
motivating application need for it---it is simply a hack around
virtual memory and the lack of full VM support in current hardware.
There are real hardware issues that interact poorly with virtual
memory, as discussed previously in this thread.

The way a messaging cycle goes in IB is:

register buf
post send from buf
wait for completion
deregister buf

This tends to get hidden via userspace software libraries into
a single call:

MPI_send(buf)

Now if you actually do the reg/dereg every time, things are very
slow.  So userspace library writers came up with the idea of caching
registrations:

if buf is not registered:
register buf
post send from buf
wait for completion

The second time that the app happens to do a send from the same
buffer, it proceeds much faster.  Spatial locality applies here, and
this caching is generally worth it.  Some libraries have schemes to
limit the size of the registration cache too.

But there are plenty of ways to hurt yourself with such a scheme.
The first being a huge pool of unused but registered memory, as the
library doesn't know the app patterns, and it doesn't know the VM
pressure level in the kernel.

There are plenty of subtle ways that this breaks too.  If the
registered buf is removed from the address space via munmap() or
sbrk() or other ways, the mapping and registration are gone, but the
library has no way of knowing that the app just did this.  Sure the
physical page is still there and pinned, but the app cannot get at
it.  Later if new address space arrives at the same virtual address
but a different physical page, the library will mistakenly think it
already has it registered properly, and data is transferred from
this old now-unmapped physical page.

The whole situation is rather ridiculuous, but we are quite stuck
with it for current generation IB and iWarp hardware.  If we can't
have the kernel interact with the device directly, we could at least
manage state in these multiple userspace registration caches.  The
VM could ask for certain (or any) pages to be released, and the
library would respond if they are indeed not in use by the device.
The app itself does not know about pinned regions, and the library
is aware of exactly which regions are potentially in use.

Since the great majority of userspace messaging over IB goes through
middleware like MPI or PGAS languages, and they all have the same
approach to registration caching, this approach could fix the
problem for a big segment of use cases.

More text on the registration caching problem is here:

http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

with an approach using vm_ops open and close operations in a kernel
module here:

http://www.osc.edu/~pw/dreg/

There is a place for VM notifiers in RDMA messaging, but not in
talking to devices, at least not the current set.  If you can define
a reasonable userspace interface for VM notifiers, libraries can
manage registration caches more efficiently, letting the kernel
unmap pinned pages as it likes.

-- Pete

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800:
 One other area that has not been brought up yet (I think) is the
 applicability of notifiers in letting users know when pinned memory
 is reclaimed by the kernel.  This is useful when a lower-level
 library employs lazy deregistration strategies on memory regions that
 are subsequently released to the kernel via the application's use of
 munmap or sbrk.  Ohio Supercomputing Center has work in this area but
 a generalized approach in the kernel would certainly be welcome.

The whole need for memory registration is a giant pain.  There is no
motivating application need for it---it is simply a hack around
virtual memory and the lack of full VM support in current hardware.
There are real hardware issues that interact poorly with virtual
memory, as discussed previously in this thread.

The way a messaging cycle goes in IB is:

register buf
post send from buf
wait for completion
deregister buf

This tends to get hidden via userspace software libraries into
a single call:

MPI_send(buf)

Now if you actually do the reg/dereg every time, things are very
slow.  So userspace library writers came up with the idea of caching
registrations:

if buf is not registered:
register buf
post send from buf
wait for completion

The second time that the app happens to do a send from the same
buffer, it proceeds much faster.  Spatial locality applies here, and
this caching is generally worth it.  Some libraries have schemes to
limit the size of the registration cache too.

But there are plenty of ways to hurt yourself with such a scheme.
The first being a huge pool of unused but registered memory, as the
library doesn't know the app patterns, and it doesn't know the VM
pressure level in the kernel.

There are plenty of subtle ways that this breaks too.  If the
registered buf is removed from the address space via munmap() or
sbrk() or other ways, the mapping and registration are gone, but the
library has no way of knowing that the app just did this.  Sure the
physical page is still there and pinned, but the app cannot get at
it.  Later if new address space arrives at the same virtual address
but a different physical page, the library will mistakenly think it
already has it registered properly, and data is transferred from
this old now-unmapped physical page.

The whole situation is rather ridiculuous, but we are quite stuck
with it for current generation IB and iWarp hardware.  If we can't
have the kernel interact with the device directly, we could at least
manage state in these multiple userspace registration caches.  The
VM could ask for certain (or any) pages to be released, and the
library would respond if they are indeed not in use by the device.
The app itself does not know about pinned regions, and the library
is aware of exactly which regions are potentially in use.

Since the great majority of userspace messaging over IB goes through
middleware like MPI or PGAS languages, and they all have the same
approach to registration caching, this approach could fix the
problem for a big segment of use cases.

More text on the registration caching problem is here:

http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

with an approach using vm_ops open and close operations in a kernel
module here:

http://www.osc.edu/~pw/dreg/

There is a place for VM notifiers in RDMA messaging, but not in
talking to devices, at least not the current set.  If you can define
a reasonable userspace interface for VM notifiers, libraries can
manage registration caches more efficiently, letting the kernel
unmap pinned pages as it likes.

-- Pete

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-02 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 30 Jan 2008 10:34 -0600:
> On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote:
> > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote:
> > >
> > > iSER has parameters to limit the maximum size of RDMA (it needs to
> > > repeat RDMA with a poor configuration)?
> > 
> > Please specify which parameters you are referring to. As you know I
> > had already repeated my tests with ridiculously high values for the
> > following iSER parameters: FirstBurstLength, MaxBurstLength and
> > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
> > size specified to dd).
> 
> the 1Mb block size is a bit of a red herring.  Unless you've
> specifically increased the max_sector_size and are using an sg_chain
> converted driver, on x86 the maximum possible transfer accumulation is
> 0.5MB.
> 
> I certainly don't rule out that increasing the transfer size up from
> 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s
> theoretical peak, that's roughly 2000 context switches per I/O; however,
> It doesn't look like you've done anything that will overcome the block
> layer limitations.

The MRDSL parameter has no effect on iSER, as the RFC describes.
How to transfer data to satisfy a command is solely up to the
target.  So you would need both big requests from the client, then
look at how the target will send the data.

I've only used 512 kB for the RDMA transfer size from the target, as
it matches the default client size and was enough to get good
performance out of my IB gear and minimizes resource consumption on
the target.  It's currently hard-coded as a #define.  There is no
provision in the protocol for the client to dictate the value.

If others want to spend some effort trying to tune stgt for iSER,
there are a fair number of comments in the code, including a big one
that explains this RDMA transfer size issue.  And I'll answer
informed questions as I can.  But I'm not particularly interested in
arguing about which implementation is best, or trying to interpret
bandwidth comparison numbers from poorly designed tests.  It takes
work to understand these issues.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Integration of SCST in the mainstream Linux kernel

2008-02-02 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 30 Jan 2008 10:34 -0600:
 On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote:
  On Jan 30, 2008 12:32 AM, FUJITA Tomonori [EMAIL PROTECTED] wrote:
  
   iSER has parameters to limit the maximum size of RDMA (it needs to
   repeat RDMA with a poor configuration)?
  
  Please specify which parameters you are referring to. As you know I
  had already repeated my tests with ridiculously high values for the
  following iSER parameters: FirstBurstLength, MaxBurstLength and
  MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block
  size specified to dd).
 
 the 1Mb block size is a bit of a red herring.  Unless you've
 specifically increased the max_sector_size and are using an sg_chain
 converted driver, on x86 the maximum possible transfer accumulation is
 0.5MB.
 
 I certainly don't rule out that increasing the transfer size up from
 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s
 theoretical peak, that's roughly 2000 context switches per I/O; however,
 It doesn't look like you've done anything that will overcome the block
 layer limitations.

The MRDSL parameter has no effect on iSER, as the RFC describes.
How to transfer data to satisfy a command is solely up to the
target.  So you would need both big requests from the client, then
look at how the target will send the data.

I've only used 512 kB for the RDMA transfer size from the target, as
it matches the default client size and was enough to get good
performance out of my IB gear and minimizes resource consumption on
the target.  It's currently hard-coded as a #define.  There is no
provision in the protocol for the client to dictate the value.

If others want to spend some effort trying to tune stgt for iSER,
there are a fair number of comments in the code, including a big one
that explains this RDMA transfer size issue.  And I'll answer
informed questions as I can.  But I'm not particularly interested in
arguing about which implementation is best, or trying to interpret
bandwidth comparison numbers from poorly designed tests.  It takes
work to understand these issues.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nosmp/maxcpus=0 or 1 -> TSC unstable

2008-01-16 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 16 Jan 2008 16:31 +0100:
> 
> * Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> 
> > > pretty sure this is the culprit is that num_possible_cpus() > 1, 
> > > which would mean cpu_possible_map contains the second cpu... but i'm 
> > > not quite sure what the right fix is... or perhaps this is all 
> > > intended.
> > 
> > We've seen the same problem.  We use gettimeofday() for timing of 
> > network-ish operations on the order of 10-50 us.  But not having the 
> > TSC makes gettimeofday() itself very slow, on the order of 30 us.
> 
> 30 usecs is too much - even with pmtimer it's typically below 5 usecs. 
> Could you run this on your box:
> 
>   http://people.redhat.com/mingo/time-warp-test/time-warp-test.c
> 
> and send back what it reports? (run it for a few minutes)

You're right.  That 30 us comes from an old comment.  Testing with
your code shows under 5 us as you expected.  I had to hack out the
#ifndef i386 error; compiling on x86-64.  Dual-socket 2.4 GHz
Opteron 250.  TSC-warps grows continually in all situations.

On 2.6.24-rc6 + scsi-misc + random stuff, 2 processors:

 | 0.39 us, TSC-warps:983002775 | 4.81 us, TOD-warps:0 | 4.81 us, CLOCK-warps:0 

With "maxcpus=1", no broken patch to force TSC:

 | 0.33 us, TSC-warps:679679972 | 3.30 us, TOD-warps:0 | 3.30 us, CLOCK-warps:0 

With "maxcpus=1", including my broken patch to force use of TSC in
this situation:

 | 0.05 us, TSC-warps:2884019968 | 0.45 us, TOD-warps:0 | 0.45 us, CLOCK-warps:0

For giggles, an older fedora kernel (2.6.23.1-42.fc8) gives:

 | 0.87 us, TSC-warps:575054334 | 8.67 us, TOD-warps:0 | 8.67 us, CLOCK-warps:0 

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nosmp/maxcpus=0 or 1 - TSC unstable

2008-01-16 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 16 Jan 2008 16:31 +0100:
 
 * Pete Wyckoff [EMAIL PROTECTED] wrote:
 
   pretty sure this is the culprit is that num_possible_cpus()  1, 
   which would mean cpu_possible_map contains the second cpu... but i'm 
   not quite sure what the right fix is... or perhaps this is all 
   intended.
  
  We've seen the same problem.  We use gettimeofday() for timing of 
  network-ish operations on the order of 10-50 us.  But not having the 
  TSC makes gettimeofday() itself very slow, on the order of 30 us.
 
 30 usecs is too much - even with pmtimer it's typically below 5 usecs. 
 Could you run this on your box:
 
   http://people.redhat.com/mingo/time-warp-test/time-warp-test.c
 
 and send back what it reports? (run it for a few minutes)

You're right.  That 30 us comes from an old comment.  Testing with
your code shows under 5 us as you expected.  I had to hack out the
#ifndef i386 error; compiling on x86-64.  Dual-socket 2.4 GHz
Opteron 250.  TSC-warps grows continually in all situations.

On 2.6.24-rc6 + scsi-misc + random stuff, 2 processors:

 | 0.39 us, TSC-warps:983002775 | 4.81 us, TOD-warps:0 | 4.81 us, CLOCK-warps:0 

With maxcpus=1, no broken patch to force TSC:

 | 0.33 us, TSC-warps:679679972 | 3.30 us, TOD-warps:0 | 3.30 us, CLOCK-warps:0 

With maxcpus=1, including my broken patch to force use of TSC in
this situation:

 | 0.05 us, TSC-warps:2884019968 | 0.45 us, TOD-warps:0 | 0.45 us, CLOCK-warps:0

For giggles, an older fedora kernel (2.6.23.1-42.fc8) gives:

 | 0.87 us, TSC-warps:575054334 | 8.67 us, TOD-warps:0 | 8.67 us, CLOCK-warps:0 

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nosmp/maxcpus=0 or 1 -> TSC unstable

2008-01-15 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Sat, 12 Jan 2008 11:48 -0800:
> if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it 
> still disables TSC :)
> 
> Marking TSC unstable due to TSCs unsynchronized
> 
> this is an opteron 2xx box which does have two cpus and no clock-divide in 
> halt or cpufreq enabled so TSC should be fine with only one cpu.
> 
> pretty sure this is the culprit is that num_possible_cpus() > 1, which 
> would mean cpu_possible_map contains the second cpu... but i'm not quite 
> sure what the right fix is... or perhaps this is all intended.

We've seen the same problem.  We use gettimeofday() for timing of
network-ish operations on the order of 10-50 us.  But not having
the TSC makes gettimeofday() itself very slow, on the order of 30 us.

Here's what we've been using for quite a few kernel versions.  I've
not tried to submit it for fear that it could break some other
scenario, as you suggest.  Although in hotplug scenarios, this
function unsynchronized_tsc() should get rerun and disable TSC if
more processors arrive.

At least count this as a "me too".

-- Pete


>From 0cdcd494bc0e27f49438bc2fc72fd3823629802b Mon Sep 17 00:00:00 2001
From: Pete Wyckoff <[EMAIL PROTECTED]>
Date: Tue, 15 Jan 2008 17:42:28 -0500
Subject: [PATCH] use tsc on 1 cpu smp

Use num_online_cpus() instead of num_present_cpus() as the
parameter to check when deciding if TSC is good enough.  Thus
explicitly booting with maxcpus=1 will let us use the TSC even on
a dual-processor machine.  This helps reduce gettimeofday
overheads on our dual Opteron nodes immensely (30 us vs 0.5 us).

Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]>
---
 arch/x86/kernel/tsc_64.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..5f2e91f 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -235,7 +235,7 @@ __cpuinit int unsynchronized_tsc(void)
}
 
/* Assume multi socket systems are not synchronized */
-   return num_present_cpus() > 1;
+   return num_online_cpus() > 1;
 }
 
 int __init notsc_setup(char *s)
-- 
1.5.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: nosmp/maxcpus=0 or 1 - TSC unstable

2008-01-15 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Sat, 12 Jan 2008 11:48 -0800:
 if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it 
 still disables TSC :)
 
 Marking TSC unstable due to TSCs unsynchronized
 
 this is an opteron 2xx box which does have two cpus and no clock-divide in 
 halt or cpufreq enabled so TSC should be fine with only one cpu.
 
 pretty sure this is the culprit is that num_possible_cpus()  1, which 
 would mean cpu_possible_map contains the second cpu... but i'm not quite 
 sure what the right fix is... or perhaps this is all intended.

We've seen the same problem.  We use gettimeofday() for timing of
network-ish operations on the order of 10-50 us.  But not having
the TSC makes gettimeofday() itself very slow, on the order of 30 us.

Here's what we've been using for quite a few kernel versions.  I've
not tried to submit it for fear that it could break some other
scenario, as you suggest.  Although in hotplug scenarios, this
function unsynchronized_tsc() should get rerun and disable TSC if
more processors arrive.

At least count this as a me too.

-- Pete


From 0cdcd494bc0e27f49438bc2fc72fd3823629802b Mon Sep 17 00:00:00 2001
From: Pete Wyckoff [EMAIL PROTECTED]
Date: Tue, 15 Jan 2008 17:42:28 -0500
Subject: [PATCH] use tsc on 1 cpu smp

Use num_online_cpus() instead of num_present_cpus() as the
parameter to check when deciding if TSC is good enough.  Thus
explicitly booting with maxcpus=1 will let us use the TSC even on
a dual-processor machine.  This helps reduce gettimeofday
overheads on our dual Opteron nodes immensely (30 us vs 0.5 us).

Signed-off-by: Pete Wyckoff [EMAIL PROTECTED]
---
 arch/x86/kernel/tsc_64.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c
index 9c70af4..5f2e91f 100644
--- a/arch/x86/kernel/tsc_64.c
+++ b/arch/x86/kernel/tsc_64.c
@@ -235,7 +235,7 @@ __cpuinit int unsynchronized_tsc(void)
}
 
/* Assume multi socket systems are not synchronized */
-   return num_present_cpus()  1;
+   return num_online_cpus()  1;
 }
 
 int __init notsc_setup(char *s)
-- 
1.5.3.7

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-14 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Fri, 11 Jan 2008 19:16 -0500:
> James Bottomley wrote:
>> On Thu, 2008-01-10 at 16:46 -0500, Pete Wyckoff wrote:
>>> [EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600:
>>>> On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote:
>>>>> [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
>>>>>> On Tue, 8 Jan 2008 17:09:18 -0500
>>>>>> Pete Wyckoff <[EMAIL PROTECTED]> wrote:
>>>>>>> I took another look at the compat approach, to see if it is feasible
>>>>>>> to keep the compat handling somewhere else, without the use of #ifdef
>>>>>>> CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
>>>>>>> The use of iovec is within a write operation on a char device.  It's
>>>>>>> not amenable to a compat_sys_ or a .compat_ioctl approach.
>>>>>>>
>>>>>>> I'm partial to #1 because the use of architecture-independent fields
>>>>>>> matches the rest of struct sg_io_v4.  But if you don't want to have
>>>>>>> another iovec type in the kernel, could we do #2 but just return
>>>>>>> -EINVAL if the need for compat is detected?  I.e. change
>>>>>>> dout_iovec_count to dout_iovec_length and do the math?
>>>>>> If you are ok with removing the write/read interface and just have
>>>>>> ioctl, we could can handle comapt stuff like others do. But I think
>>>>>> that you (OSD people) really want to keep the write/read
>>>>>> interface. Sorry, I think that there is no workaround to support iovec
>>>>>> in bsg.
>>>>> I don't care about read/write in particular.  But we do need some
>>>>> way to launch asynchronous SCSI commands, and currently read/write
>>>>> are the only way to do that in bsg.  The reason is to keep multiple
>>>>> spindles busy at the same time.
>>>> Won't multi-threading the ioctl calls achieve the same effect?  Or do
>>>> you trip over BKL there?
>>> There's no BKL on (new) ioctls anymore, at least.  A thread per
>>> device would be feasible perhaps.  But if you want any sort of
>>> pipelining out of the device, esp. in the remote iSCSI case, you
>>> need to have a good number of commands outstanding to each device.
>>> So a thread per command per device.  Typical iSCSI queue depth of
>>> 128 times 16 devices for a small setup is a lot of threads.
>>
>> I was actually thinking of a thread per outstanding command.
>>
>>> The pthread/pipe latency overhead is not insignificant for fast
>>> storage networks too.
>>>
>>>>> How about these new ioctls instead of read/write:
>>>>>
>>>>> SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
>>>>> SG_IO_TEST   - complete and return a previous req
>>>>> SG_IO_WAIT   - wait for a req to finish, interruptibly
>>>>>
>>>>> Then old write users will instead do ioctl SUBMIT.  Read users will
>>>>> do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
>>>>> be implemented as SUBMIT + WAIT.
>>>>>
>>>>> Then we can do compat_ioctl and convert up iovecs out-of-line before
>>>>> calling the normal functions.
>>>>>
>>>>> Let me know if you want a patch for this.
>>>> Really, the thought of re-inventing yet another async I/O interface
>>>> isn't very appealing.
>>> I'm fine with read/write, except Tomo is against handling iovecs
>>> because of the compat complexity with struct iovec being different
>>> on 32- vs 64-bit.  There is a standard way to do "compat" ioctl that
>>> hides this handling in a different file (not bsg.c), which is the
>>> only reason I'm even considering these ioctls.  I don't care about
>>> compat setups per se.
>>>
>>> Is there another async I/O mechanism?  Userspace builds the CDBs,
>>> just needs some way to drop them in SCSI ML.  BSG is almost perfect
>>> for this, but doesn't do iovec, leading to lots of memcpy.
>>
>> No, it's just that async interfaces in Linux have a long and fairly
>> unhappy history.
>
> The sg driver's async interface has been pretty stable for
> a long time. The sync SG_IO ioctl is built on top of the
> async interface. That makes the async interface extremely
> well tested.
>
> The write()/read() async interface in s

Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-14 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Fri, 11 Jan 2008 19:16 -0500:
 James Bottomley wrote:
 On Thu, 2008-01-10 at 16:46 -0500, Pete Wyckoff wrote:
 [EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600:
 On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote:
 [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
 On Tue, 8 Jan 2008 17:09:18 -0500
 Pete Wyckoff [EMAIL PROTECTED] wrote:
 I took another look at the compat approach, to see if it is feasible
 to keep the compat handling somewhere else, without the use of #ifdef
 CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
 The use of iovec is within a write operation on a char device.  It's
 not amenable to a compat_sys_ or a .compat_ioctl approach.

 I'm partial to #1 because the use of architecture-independent fields
 matches the rest of struct sg_io_v4.  But if you don't want to have
 another iovec type in the kernel, could we do #2 but just return
 -EINVAL if the need for compat is detected?  I.e. change
 dout_iovec_count to dout_iovec_length and do the math?
 If you are ok with removing the write/read interface and just have
 ioctl, we could can handle comapt stuff like others do. But I think
 that you (OSD people) really want to keep the write/read
 interface. Sorry, I think that there is no workaround to support iovec
 in bsg.
 I don't care about read/write in particular.  But we do need some
 way to launch asynchronous SCSI commands, and currently read/write
 are the only way to do that in bsg.  The reason is to keep multiple
 spindles busy at the same time.
 Won't multi-threading the ioctl calls achieve the same effect?  Or do
 you trip over BKL there?
 There's no BKL on (new) ioctls anymore, at least.  A thread per
 device would be feasible perhaps.  But if you want any sort of
 pipelining out of the device, esp. in the remote iSCSI case, you
 need to have a good number of commands outstanding to each device.
 So a thread per command per device.  Typical iSCSI queue depth of
 128 times 16 devices for a small setup is a lot of threads.

 I was actually thinking of a thread per outstanding command.

 The pthread/pipe latency overhead is not insignificant for fast
 storage networks too.

 How about these new ioctls instead of read/write:

 SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
 SG_IO_TEST   - complete and return a previous req
 SG_IO_WAIT   - wait for a req to finish, interruptibly

 Then old write users will instead do ioctl SUBMIT.  Read users will
 do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
 be implemented as SUBMIT + WAIT.

 Then we can do compat_ioctl and convert up iovecs out-of-line before
 calling the normal functions.

 Let me know if you want a patch for this.
 Really, the thought of re-inventing yet another async I/O interface
 isn't very appealing.
 I'm fine with read/write, except Tomo is against handling iovecs
 because of the compat complexity with struct iovec being different
 on 32- vs 64-bit.  There is a standard way to do compat ioctl that
 hides this handling in a different file (not bsg.c), which is the
 only reason I'm even considering these ioctls.  I don't care about
 compat setups per se.

 Is there another async I/O mechanism?  Userspace builds the CDBs,
 just needs some way to drop them in SCSI ML.  BSG is almost perfect
 for this, but doesn't do iovec, leading to lots of memcpy.

 No, it's just that async interfaces in Linux have a long and fairly
 unhappy history.

 The sg driver's async interface has been pretty stable for
 a long time. The sync SG_IO ioctl is built on top of the
 async interface. That makes the async interface extremely
 well tested.

 The write()/read() async interface in sg does have one
 problem: when a command is dispatched via a write()
 it would be very useful to get back a tag but that
 violates write()'s second argument: 'const void * buf'.
 That tag could be useful both for identification of the
 response and by task management functions.

 I was hoping that the 'flags' field in sgv4 could be used
 to implement the variants:
 SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
 SG_IO_TEST   - complete and return a previous req
 SG_IO_WAIT   - wait for a req to finish, interruptibly

 that way the existing SG_IO ioctl is sufficient.

 And if Tomo doesn't want to do it in the bsg driver,
 then it could be done it the sg driver.

The sg driver already has async via read/write, and it works fine.
Perhaps someone wants the ioctl versions too, but that's not my main
goal here.

I think that sg doesn't bother with compat iovec handling.  It just
uses SZ_SG_IOVEC (defined as sizeof(sg_iovec_t)) and doesn't check
if the userspace iovec happens to be smaller.  Sg does have a compat
ioctl function; it just doesn't support SG_IO.  So, on 64-bit
kernel, read/write from 32-bit userspace with iovec will get
undefined results due to the mis-interpretation of the iovec fields,
while ioctl from 32-bit will fail with ENOIOCTLCMD

Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-10 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600:
> On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote:
> > [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
> > > On Tue, 8 Jan 2008 17:09:18 -0500
> > > Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> > > > I took another look at the compat approach, to see if it is feasible
> > > > to keep the compat handling somewhere else, without the use of #ifdef
> > > > CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
> > > > The use of iovec is within a write operation on a char device.  It's
> > > > not amenable to a compat_sys_ or a .compat_ioctl approach.
> > > > 
> > > > I'm partial to #1 because the use of architecture-independent fields
> > > > matches the rest of struct sg_io_v4.  But if you don't want to have
> > > > another iovec type in the kernel, could we do #2 but just return
> > > > -EINVAL if the need for compat is detected?  I.e. change
> > > > dout_iovec_count to dout_iovec_length and do the math?
> > > 
> > > If you are ok with removing the write/read interface and just have
> > > ioctl, we could can handle comapt stuff like others do. But I think
> > > that you (OSD people) really want to keep the write/read
> > > interface. Sorry, I think that there is no workaround to support iovec
> > > in bsg.
> > 
> > I don't care about read/write in particular.  But we do need some
> > way to launch asynchronous SCSI commands, and currently read/write
> > are the only way to do that in bsg.  The reason is to keep multiple
> > spindles busy at the same time.
> 
> Won't multi-threading the ioctl calls achieve the same effect?  Or do
> you trip over BKL there?

There's no BKL on (new) ioctls anymore, at least.  A thread per
device would be feasible perhaps.  But if you want any sort of
pipelining out of the device, esp. in the remote iSCSI case, you
need to have a good number of commands outstanding to each device.
So a thread per command per device.  Typical iSCSI queue depth of
128 times 16 devices for a small setup is a lot of threads.

The pthread/pipe latency overhead is not insignificant for fast
storage networks too.

> > How about these new ioctls instead of read/write:
> > 
> > SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
> > SG_IO_TEST   - complete and return a previous req
> > SG_IO_WAIT   - wait for a req to finish, interruptibly
> > 
> > Then old write users will instead do ioctl SUBMIT.  Read users will
> > do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
> > be implemented as SUBMIT + WAIT.
> > 
> > Then we can do compat_ioctl and convert up iovecs out-of-line before
> > calling the normal functions.
> > 
> > Let me know if you want a patch for this.
> 
> Really, the thought of re-inventing yet another async I/O interface
> isn't very appealing.

I'm fine with read/write, except Tomo is against handling iovecs
because of the compat complexity with struct iovec being different
on 32- vs 64-bit.  There is a standard way to do "compat" ioctl that
hides this handling in a different file (not bsg.c), which is the
only reason I'm even considering these ioctls.  I don't care about
compat setups per se.

Is there another async I/O mechanism?  Userspace builds the CDBs,
just needs some way to drop them in SCSI ML.  BSG is almost perfect
for this, but doesn't do iovec, leading to lots of memcpy.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-10 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
> On Tue, 8 Jan 2008 17:09:18 -0500
> Pete Wyckoff <[EMAIL PROTECTED]> wrote:
> > I took another look at the compat approach, to see if it is feasible
> > to keep the compat handling somewhere else, without the use of #ifdef
> > CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
> > The use of iovec is within a write operation on a char device.  It's
> > not amenable to a compat_sys_ or a .compat_ioctl approach.
> > 
> > I'm partial to #1 because the use of architecture-independent fields
> > matches the rest of struct sg_io_v4.  But if you don't want to have
> > another iovec type in the kernel, could we do #2 but just return
> > -EINVAL if the need for compat is detected?  I.e. change
> > dout_iovec_count to dout_iovec_length and do the math?
> 
> If you are ok with removing the write/read interface and just have
> ioctl, we could can handle comapt stuff like others do. But I think
> that you (OSD people) really want to keep the write/read
> interface. Sorry, I think that there is no workaround to support iovec
> in bsg.

I don't care about read/write in particular.  But we do need some
way to launch asynchronous SCSI commands, and currently read/write
are the only way to do that in bsg.  The reason is to keep multiple
spindles busy at the same time.

How about these new ioctls instead of read/write:

SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
SG_IO_TEST   - complete and return a previous req
SG_IO_WAIT   - wait for a req to finish, interruptibly

Then old write users will instead do ioctl SUBMIT.  Read users will
do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
be implemented as SUBMIT + WAIT.

Then we can do compat_ioctl and convert up iovecs out-of-line before
calling the normal functions.

Let me know if you want a patch for this.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-10 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
 On Tue, 8 Jan 2008 17:09:18 -0500
 Pete Wyckoff [EMAIL PROTECTED] wrote:
  I took another look at the compat approach, to see if it is feasible
  to keep the compat handling somewhere else, without the use of #ifdef
  CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
  The use of iovec is within a write operation on a char device.  It's
  not amenable to a compat_sys_ or a .compat_ioctl approach.
  
  I'm partial to #1 because the use of architecture-independent fields
  matches the rest of struct sg_io_v4.  But if you don't want to have
  another iovec type in the kernel, could we do #2 but just return
  -EINVAL if the need for compat is detected?  I.e. change
  dout_iovec_count to dout_iovec_length and do the math?
 
 If you are ok with removing the write/read interface and just have
 ioctl, we could can handle comapt stuff like others do. But I think
 that you (OSD people) really want to keep the write/read
 interface. Sorry, I think that there is no workaround to support iovec
 in bsg.

I don't care about read/write in particular.  But we do need some
way to launch asynchronous SCSI commands, and currently read/write
are the only way to do that in bsg.  The reason is to keep multiple
spindles busy at the same time.

How about these new ioctls instead of read/write:

SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
SG_IO_TEST   - complete and return a previous req
SG_IO_WAIT   - wait for a req to finish, interruptibly

Then old write users will instead do ioctl SUBMIT.  Read users will
do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
be implemented as SUBMIT + WAIT.

Then we can do compat_ioctl and convert up iovecs out-of-line before
calling the normal functions.

Let me know if you want a patch for this.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-10 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600:
 On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote:
  [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900:
   On Tue, 8 Jan 2008 17:09:18 -0500
   Pete Wyckoff [EMAIL PROTECTED] wrote:
I took another look at the compat approach, to see if it is feasible
to keep the compat handling somewhere else, without the use of #ifdef
CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
The use of iovec is within a write operation on a char device.  It's
not amenable to a compat_sys_ or a .compat_ioctl approach.

I'm partial to #1 because the use of architecture-independent fields
matches the rest of struct sg_io_v4.  But if you don't want to have
another iovec type in the kernel, could we do #2 but just return
-EINVAL if the need for compat is detected?  I.e. change
dout_iovec_count to dout_iovec_length and do the math?
   
   If you are ok with removing the write/read interface and just have
   ioctl, we could can handle comapt stuff like others do. But I think
   that you (OSD people) really want to keep the write/read
   interface. Sorry, I think that there is no workaround to support iovec
   in bsg.
  
  I don't care about read/write in particular.  But we do need some
  way to launch asynchronous SCSI commands, and currently read/write
  are the only way to do that in bsg.  The reason is to keep multiple
  spindles busy at the same time.
 
 Won't multi-threading the ioctl calls achieve the same effect?  Or do
 you trip over BKL there?

There's no BKL on (new) ioctls anymore, at least.  A thread per
device would be feasible perhaps.  But if you want any sort of
pipelining out of the device, esp. in the remote iSCSI case, you
need to have a good number of commands outstanding to each device.
So a thread per command per device.  Typical iSCSI queue depth of
128 times 16 devices for a small setup is a lot of threads.

The pthread/pipe latency overhead is not insignificant for fast
storage networks too.

  How about these new ioctls instead of read/write:
  
  SG_IO_SUBMIT - start a new blk_execute_rq_nowait()
  SG_IO_TEST   - complete and return a previous req
  SG_IO_WAIT   - wait for a req to finish, interruptibly
  
  Then old write users will instead do ioctl SUBMIT.  Read users will
  do TEST for non-blocking fd, or WAIT for blocking.  And SG_IO could
  be implemented as SUBMIT + WAIT.
  
  Then we can do compat_ioctl and convert up iovecs out-of-line before
  calling the normal functions.
  
  Let me know if you want a patch for this.
 
 Really, the thought of re-inventing yet another async I/O interface
 isn't very appealing.

I'm fine with read/write, except Tomo is against handling iovecs
because of the compat complexity with struct iovec being different
on 32- vs 64-bit.  There is a standard way to do compat ioctl that
hides this handling in a different file (not bsg.c), which is the
only reason I'm even considering these ioctls.  I don't care about
compat setups per se.

Is there another async I/O mechanism?  Userspace builds the CDBs,
just needs some way to drop them in SCSI ML.  BSG is almost perfect
for this, but doesn't do iovec, leading to lots of memcpy.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-08 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Sat, 05 Jan 2008 14:01 +0900:
> From: Deepak Colluru <[EMAIL PROTECTED]>
> Subject: [PATCH] bsg : Add support for io vectors in bsg
> Date: Fri, 4 Jan 2008 21:47:34 +0530 (IST)
> 
> > From: Deepak Colluru <[EMAIL PROTECTED]>
> > 
> > Add support for io vectors in bsg.
> > 
> > Signed-off-by: Deepak Colluru <[EMAIL PROTECTED]>
> > ---
> >   bsg.c |   52 +---
> >   1 file changed, 49 insertions(+), 3 deletions(-)
> 
> Thanks, but I have to NACK this.
> 
> You can find the discussion about bsg io vector support and a similar
> patch in linux-scsi archive. I have no plan to support it since it
> needs the compat hack.

You may recall this is one of the patches I need to use bsg with OSD
devices.  OSDs overload the SCSI buffer model to put mulitple fields
in dataout and datain.  Some is user data, but some is more
logically created by a library.  Memcpying in userspace to wedge all
the segments into a single buffer is painful, and is required both
on outgoing and incoming data buffers.

There are two approaches to add iovec to bsg.

1.  Define a new sg_iovec_v4 that uses constant width types.  Both
32- and 64-bit userspace would hand arrays of this to the kernel.

struct sg_v4_iovec {
__u64 iov_base;
__u32 iov_len;
__u32 __pad1;
};

Old patch here:  http://article.gmane.org/gmane.linux.scsi/30461/


2.  Do as Deepak has done, using the existing sg_iovec, but then
also work around the compat issue.  Old v3 sg_iovec is:

typedef struct sg_iovec /* same structure as used by readv() Linux system */
{   /* call. It defines one scatter-gather element. */
void __user *iov_base;  /* Starting address  */
size_t iov_len; /* Length in bytes  */
} sg_iovec_t;

Old patch here:  http://article.gmane.org/gmane.linux.scsi/30460/

I took another look at the compat approach, to see if it is feasible
to keep the compat handling somewhere else, without the use of #ifdef
CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
The use of iovec is within a write operation on a char device.  It's
not amenable to a compat_sys_ or a .compat_ioctl approach.

I'm partial to #1 because the use of architecture-independent fields
matches the rest of struct sg_io_v4.  But if you don't want to have
another iovec type in the kernel, could we do #2 but just return
-EINVAL if the need for compat is detected?  I.e. change
dout_iovec_count to dout_iovec_length and do the math?

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bsg : Add support for io vectors in bsg

2008-01-08 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Sat, 05 Jan 2008 14:01 +0900:
 From: Deepak Colluru [EMAIL PROTECTED]
 Subject: [PATCH] bsg : Add support for io vectors in bsg
 Date: Fri, 4 Jan 2008 21:47:34 +0530 (IST)
 
  From: Deepak Colluru [EMAIL PROTECTED]
  
  Add support for io vectors in bsg.
  
  Signed-off-by: Deepak Colluru [EMAIL PROTECTED]
  ---
bsg.c |   52 +---
1 file changed, 49 insertions(+), 3 deletions(-)
 
 Thanks, but I have to NACK this.
 
 You can find the discussion about bsg io vector support and a similar
 patch in linux-scsi archive. I have no plan to support it since it
 needs the compat hack.

You may recall this is one of the patches I need to use bsg with OSD
devices.  OSDs overload the SCSI buffer model to put mulitple fields
in dataout and datain.  Some is user data, but some is more
logically created by a library.  Memcpying in userspace to wedge all
the segments into a single buffer is painful, and is required both
on outgoing and incoming data buffers.

There are two approaches to add iovec to bsg.

1.  Define a new sg_iovec_v4 that uses constant width types.  Both
32- and 64-bit userspace would hand arrays of this to the kernel.

struct sg_v4_iovec {
__u64 iov_base;
__u32 iov_len;
__u32 __pad1;
};

Old patch here:  http://article.gmane.org/gmane.linux.scsi/30461/


2.  Do as Deepak has done, using the existing sg_iovec, but then
also work around the compat issue.  Old v3 sg_iovec is:

typedef struct sg_iovec /* same structure as used by readv() Linux system */
{   /* call. It defines one scatter-gather element. */
void __user *iov_base;  /* Starting address  */
size_t iov_len; /* Length in bytes  */
} sg_iovec_t;

Old patch here:  http://article.gmane.org/gmane.linux.scsi/30460/

I took another look at the compat approach, to see if it is feasible
to keep the compat handling somewhere else, without the use of #ifdef
CONFIG_COMPAT and size-comparison code inside bsg.c.  I don't see how.
The use of iovec is within a write operation on a char device.  It's
not amenable to a compat_sys_ or a .compat_ioctl approach.

I'm partial to #1 because the use of architecture-independent fields
matches the rest of struct sg_io_v4.  But if you don't want to have
another iovec type in the kernel, could we do #2 but just return
-EINVAL if the need for compat is detected?  I.e. change
dout_iovec_count to dout_iovec_length and do the math?

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: list corruption on ib_srp load in v2.6.24-rc5

2007-12-22 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Fri, 21 Dec 2007 17:18 -0500:
> On Fri, 2007-12-21 at 16:52 -0500, David Dillow wrote:
> > I'm getting the following oops when doing the following commands:
> > 
> > modprobe ib_srp
> > 
> > rmmod ib_srp
> > modprobe ib_srp
> > 
> > 
> > I'm going to try and track down how the list is getting corrupted; it
> > looks like attribute_container_list in
> > drivers/base/attribute_container.c is the one getting corrupted.
> 
> Ok, found the culprit, now to figure out the motive and fix it.
> 
> ib_srp's srp_cleanup_module calls srp_release_transport(), which calls
> transport_container_unregister() for the rport_attr_cont member of
> struct srp_internal.
> 
> That last unregister call is returning -EBUSY, but it gets ignored, and
> the list node gets erased (or just reused) when the module's text/memory
> is free'd.
> 
> Now, to see if ib_srp should be waiting for everything to be destroyed
> before calling srp_release_transport(), or if it is just not removing
> some attributes properly.

I don't see where srp_cleanup_module() is calling srp_remove_host().
That is the likely way that transport devices should be made to go
away.  Something on the order of srp_remove_work().

Or srp_remove_one() except with a call to srp_remove_host() may be
necessary.  In fact, maybe just adding that call will fix it, as
ib_unregister_client should drive the remove function.  Guesses, all
this.

-- Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ofa-general] Re: list corruption on ib_srp load in v2.6.24-rc5

2007-12-22 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Fri, 21 Dec 2007 17:18 -0500:
 On Fri, 2007-12-21 at 16:52 -0500, David Dillow wrote:
  I'm getting the following oops when doing the following commands:
  
  modprobe ib_srp
  add targets(s) to ib_srp using sysfs
  rmmod ib_srp
  modprobe ib_srp
  OOPS
  
  I'm going to try and track down how the list is getting corrupted; it
  looks like attribute_container_list in
  drivers/base/attribute_container.c is the one getting corrupted.
 
 Ok, found the culprit, now to figure out the motive and fix it.
 
 ib_srp's srp_cleanup_module calls srp_release_transport(), which calls
 transport_container_unregister() for the rport_attr_cont member of
 struct srp_internal.
 
 That last unregister call is returning -EBUSY, but it gets ignored, and
 the list node gets erased (or just reused) when the module's text/memory
 is free'd.
 
 Now, to see if ib_srp should be waiting for everything to be destroyed
 before calling srp_release_transport(), or if it is just not removing
 some attributes properly.

I don't see where srp_cleanup_module() is calling srp_remove_host().
That is the likely way that transport devices should be made to go
away.  Something on the order of srp_remove_work().

Or srp_remove_one() except with a call to srp_remove_host() may be
necessary.  In fact, maybe just adding that call will fix it, as
ib_unregister_client should drive the remove function.  Guesses, all
this.

-- Pete
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: getrusage vs /proc/pid/stat?

2001-06-18 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> I'd like to monitor CPU, memory, and I/O utilization in a 
> long-running multithreaded daemon under kernels 2.2, 2.4,
> and possibly also Solaris (#ifdefs are ok).
> 
> getrusage() looked promising, and might even work for CPU utilization.
> Dunno if it returns info for all child threads yet, haven't tried it.
> In Linux, though, getrusage() doesn't return any info about RAM.
> 
> I know I can get the RSS and VSIZE under Linux by parsing /proc/pid/stat,
> but was hoping for a faster interface (although I suppose a seek,
> a read, and an ascii parse isn't *that* slow).  Is /proc/pid/stat
> the only way to go under Linux to monitor RSS?

getrusage() isn't really the system call you want for this.

There is a max RSS value, which linux could support but doesn't, but
you seem to want to see the current RSS over time.  Search the archive
for various patches/complaints about getrusage.

For vsize, most OSes offer time-integral averages of text, data, and
stack sizes via getrusage().  Again, more of an aggregate than a current
snapshot, and again, linux returns zero for these currently.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [docPATCH] mm.h documentation

2001-06-18 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
>  typedef struct page {
[..]
> + unsigned long index;/* Our offset within mapping. */

[..]
> + * A page may belong to an inode's memory mapping. In this case,
> + * page->mapping is the pointer to the inode, and page->offset is the
> + * file offset of the page (not necessarily a multiple of PAGE_SIZE).

Minor nit.

The field offset was renamed to index some time ago, but I can't
figure out if the usage changed.  Can you fix the comment and educate
us?

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [docPATCH] mm.h documentation

2001-06-18 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
  typedef struct page {
[..]
 + unsigned long index;/* Our offset within mapping. */

[..]
 + * A page may belong to an inode's memory mapping. In this case,
 + * page-mapping is the pointer to the inode, and page-offset is the
 + * file offset of the page (not necessarily a multiple of PAGE_SIZE).

Minor nit.

The field offset was renamed to index some time ago, but I can't
figure out if the usage changed.  Can you fix the comment and educate
us?

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: getrusage vs /proc/pid/stat?

2001-06-18 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 I'd like to monitor CPU, memory, and I/O utilization in a 
 long-running multithreaded daemon under kernels 2.2, 2.4,
 and possibly also Solaris (#ifdefs are ok).
 
 getrusage() looked promising, and might even work for CPU utilization.
 Dunno if it returns info for all child threads yet, haven't tried it.
 In Linux, though, getrusage() doesn't return any info about RAM.
 
 I know I can get the RSS and VSIZE under Linux by parsing /proc/pid/stat,
 but was hoping for a faster interface (although I suppose a seek,
 a read, and an ascii parse isn't *that* slow).  Is /proc/pid/stat
 the only way to go under Linux to monitor RSS?

getrusage() isn't really the system call you want for this.

There is a max RSS value, which linux could support but doesn't, but
you seem to want to see the current RSS over time.  Search the archive
for various patches/complaints about getrusage.

For vsize, most OSes offer time-integral averages of text, data, and
stack sizes via getrusage().  Again, more of an aggregate than a current
snapshot, and again, linux returns zero for these currently.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 3com Driver and the 3XP Processor

2001-06-15 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> [EMAIL PROTECTED] writes:
>  > So are there any intresting changes one can make to the acenic?
> 
> Like I said, there is an entire firmware developer kit, so the only
> limit is your imagination and coding skills :-)

I wrote a new firmware from scratch to offload most of a message passing
implementation (MPI, actually) into the Alteon NIC, including timeout
and retransmit, message matching, header building, etc.  It's almost
ready for release.  We're currently working on using both processors
of the Tigon in parallel.

There is other work which has modified the Alteon firmware to do, for
example, segmentation and reassembly (Underwood et al.), and further
work in progress on other topics.

Programmable NICs are fun.  I'd like to get one of those Tigon3 cards
to see if any of the register layout on the Tigon2 is still there,
hoping the interface is similar at least.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: 3com Driver and the 3XP Processor

2001-06-15 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 [EMAIL PROTECTED] writes:
   So are there any intresting changes one can make to the acenic?
 
 Like I said, there is an entire firmware developer kit, so the only
 limit is your imagination and coding skills :-)

I wrote a new firmware from scratch to offload most of a message passing
implementation (MPI, actually) into the Alteon NIC, including timeout
and retransmit, message matching, header building, etc.  It's almost
ready for release.  We're currently working on using both processors
of the Tigon in parallel.

There is other work which has modified the Alteon firmware to do, for
example, segmentation and reassembly (Underwood et al.), and further
work in progress on other topics.

Programmable NICs are fun.  I'd like to get one of those Tigon3 cards
to see if any of the register layout on the Tigon2 is still there,
hoping the interface is similar at least.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> My driver uses a variable-size DMA buffer that it shares with user-space; I
> provide an ioctl() to choose the buffer size and allocate the buffer. Say
> the user program chooses a large buffer size, and mmap()s the entire buffer.
> Later, the program calls the ioctl() again to set a smaller buffer size, or
> closes the file descriptor. At this point I'd like to shrink the buffer or
> free it completely. But I can't assume that the program will be nice and
> munmap() the region for me - it might still have the large buffer mapped.
> What should I do here?
> 
> An easy solution would to allocate the largest possible buffer as my driver
> is loaded, even if not all of it will be exposed to user-space. I don't
> really like this choice because the buffer needs to be pinned in memory, and
> the largest useful buffer size is very big (several tens of MB). Maybe I
> should disallow more than one buffer allocation per open() of the device...
> But the memory mapping will stay around even after close(), correct? I'd
> hate to have to keep the buffer around until my driver module is unloaded.

I see.  Your explanation makes sense, and close won't affect the mmap
(unless you explicitly trigger it in the driver, which you should).
Other drivers deal with this; you may want to go grepping for do_munmap
and see how they handle it.

[ > [EMAIL PROTECTED] said: ]
> > However, do_munmap() will call zap_page_range() for you and take care of
> > cache and TLB flushing if you're going to do this in the kernel.
> 
> I'm not sure if I could use do_munmap() -- how will I know if the user
> program has called munmap() already, and then mmap()ed something else in the
> same place? Then I'd be killing the wrong mapping...

Look at drivers/char/drm, for example.  At mmap time they allocate a
vm_ops to the address space.  With that you catch changes to the vma
structure initiated by a user mmap, munmap, etc.  You could also
dynamically map the pages in using the nopage method (optional).

The less elegant way of doing this is to put in your userspace API some
conditions which say:  if you do the following:

open(fd);
ioctl(fd, give_me_the_buf);
munmap(some_or_all_of_the_buf);
buf2 = mmap(...);
close(fd);  /* or ioctl(fd, shrink_the_buf); */
buf2[27] = 3;

you will be killed with a sigbus.  I.e. don't manually munmap the
region.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> I am writing a device driver that, like many others, exposes a shared memory
> region to user-space via mmap(). The region is allocated with vmalloc(), the
> pages are marked reserved, and the user-space mapping is implemented with
> remap_page_range().
> 
> In my driver, I may have to free the underlying vmalloc() region while the
> user-space program is still running. I need to remove the user-space
> mapping -- otherwise the user process would still have access to the
> now-freed pages. I need an inverse of remap_page_range().
> 
> Is zap_page_range() the function I am looking for? Unfortunately it's not
> exported to modules =(. As a quick fix, I was thinking I could just remap
> all of the user pages to point to a zeroed page or something...
> 
> Another question- in the mm.c sources, I see that many of the memory-mapping
> functions are surrounded by calls to flush_cache_range() and
> flush_tlb_range(). But I don't see these calls in many drivers. Is it
> necessary to make them when my driver maps or unmaps the shared memory
> region?

That seems a bit perverse.  How will the poor userspace program know
not to access the pages you have yanked away from it?  If you plan
to kill it, better to do that directly.  If you plan to signal it
that the mapping is gone, it can just call munmap() itself.

However, do_munmap() will call zap_page_range() for you and take care of
cache and TLB flushing if you're going to do this in the kernel.

Your driver mmap function is called by do_mmap_pgoff() which takes
care of those issues, and there is no (*munmap) in file_operations---
perhaps you are the first driver writer to want to unmap in the kernel.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 I am writing a device driver that, like many others, exposes a shared memory
 region to user-space via mmap(). The region is allocated with vmalloc(), the
 pages are marked reserved, and the user-space mapping is implemented with
 remap_page_range().
 
 In my driver, I may have to free the underlying vmalloc() region while the
 user-space program is still running. I need to remove the user-space
 mapping -- otherwise the user process would still have access to the
 now-freed pages. I need an inverse of remap_page_range().
 
 Is zap_page_range() the function I am looking for? Unfortunately it's not
 exported to modules =(. As a quick fix, I was thinking I could just remap
 all of the user pages to point to a zeroed page or something...
 
 Another question- in the mm.c sources, I see that many of the memory-mapping
 functions are surrounded by calls to flush_cache_range() and
 flush_tlb_range(). But I don't see these calls in many drivers. Is it
 necessary to make them when my driver maps or unmaps the shared memory
 region?

That seems a bit perverse.  How will the poor userspace program know
not to access the pages you have yanked away from it?  If you plan
to kill it, better to do that directly.  If you plan to signal it
that the mapping is gone, it can just call munmap() itself.

However, do_munmap() will call zap_page_range() for you and take care of
cache and TLB flushing if you're going to do this in the kernel.

Your driver mmap function is called by do_mmap_pgoff() which takes
care of those issues, and there is no (*munmap) in file_operations---
perhaps you are the first driver writer to want to unmap in the kernel.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: forcibly unmap pages in driver?

2001-06-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 My driver uses a variable-size DMA buffer that it shares with user-space; I
 provide an ioctl() to choose the buffer size and allocate the buffer. Say
 the user program chooses a large buffer size, and mmap()s the entire buffer.
 Later, the program calls the ioctl() again to set a smaller buffer size, or
 closes the file descriptor. At this point I'd like to shrink the buffer or
 free it completely. But I can't assume that the program will be nice and
 munmap() the region for me - it might still have the large buffer mapped.
 What should I do here?
 
 An easy solution would to allocate the largest possible buffer as my driver
 is loaded, even if not all of it will be exposed to user-space. I don't
 really like this choice because the buffer needs to be pinned in memory, and
 the largest useful buffer size is very big (several tens of MB). Maybe I
 should disallow more than one buffer allocation per open() of the device...
 But the memory mapping will stay around even after close(), correct? I'd
 hate to have to keep the buffer around until my driver module is unloaded.

I see.  Your explanation makes sense, and close won't affect the mmap
(unless you explicitly trigger it in the driver, which you should).
Other drivers deal with this; you may want to go grepping for do_munmap
and see how they handle it.

[  [EMAIL PROTECTED] said: ]
  However, do_munmap() will call zap_page_range() for you and take care of
  cache and TLB flushing if you're going to do this in the kernel.
 
 I'm not sure if I could use do_munmap() -- how will I know if the user
 program has called munmap() already, and then mmap()ed something else in the
 same place? Then I'd be killing the wrong mapping...

Look at drivers/char/drm, for example.  At mmap time they allocate a
vm_ops to the address space.  With that you catch changes to the vma
structure initiated by a user mmap, munmap, etc.  You could also
dynamically map the pages in using the nopage method (optional).

The less elegant way of doing this is to put in your userspace API some
conditions which say:  if you do the following:

open(fd);
ioctl(fd, give_me_the_buf);
munmap(some_or_all_of_the_buf);
buf2 = mmap(...);
close(fd);  /* or ioctl(fd, shrink_the_buf); */
buf2[27] = 3;

you will be killed with a sigbus.  I.e. don't manually munmap the
region.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Makefile patch for cscope and saner Ctags

2001-05-31 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> The following patch generates saner Ctags and will build cscope
> output.  It's against 2.4.5
> 
> --- Makefile.old  Mon May 28 22:44:01 2001
> +++ Makefile  Wed May 30 17:50:01 2001
> @@ -334,11 +334,32 @@
>  
>  # Exuberant ctags works better with -I
>  tags: dummy
> - CTAGSF=`ctags --version | grep -i exuberant >/dev/null && echo "-I 
>__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS"`; \
> + CTAGSF=`ctags --version | grep -i exuberant >/dev/null && echo "--sort=no -I 
>__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS"`; \
>   ctags $$CTAGSF `find include/asm-$(ARCH) -name '*.h'` && \
> - find include -type d \( -name "asm-*" -o -name config \) -prune -o -name '*.h' 
>-print | xargs ctags $$CTAGSF -a && \
> + find include -type f -name '*.h' -mindepth 2 -maxdepth 2 \
> + | grep -v include/asm- | grep -v include/config \
> + | xargs -r ctags $$CTAGSF -a && \
> + find include -type f -name '*.h' -mindepth 3 -maxdepth 3 \
> + | grep -v include/asm- | grep -v include/config \
> + | xargs -r ctags $$CTAGSF -a && \
> + find include -type f -name '*.h' -mindepth 4 -maxdepth 4 \
> + | grep -v include/asm- | grep -v include/config \
> + | xargs -r ctags $$CTAGSF -a && \
> + find include -type f -name '*.h' -mindepth 5 -maxdepth 5 \
> + | grep -v include/asm- | grep -v include/config \
> + | xargs -r ctags $$CTAGSF -a && \
>   find $(SUBDIRS) init -name '*.c' | xargs ctags $$CTAGSF -a
> + mv tags tags.unsorted
> + LC_ALL=C sort -k 1,1 -s tags.unsorted > tags
> + rm tags.unsorted
>  
> +cscope: dummy
> + find include/asm-$(ARCH) -name '*.h' >cscope.files
> + find include $(SUBDIRS) init -type f -name '*.[ch]' \
> + | grep -v include/asm- | grep -v include/config >> cscope.files
> + cscope -b -I include
> +
> + 
>  ifdef CONFIG_MODULES
>  ifdef CONFIG_MODVERSIONS
>  MODFLAGS += -DMODVERSIONS -include $(HPATH)/linux/modversions.h

You seem not to have read my response to your earlier mail proprosing
such a thing (for tags only, not cscope):

http://boudicca.tux.org/hypermail/linux-kernel/2001week21/1869.html

How does the patch above fix anything?  You're sorting so that
include/linux/*.h comes before include/linux/{mtd,lockd,raid,...}/*.h,
but I don't see how that can be an improvement, or how it addresses
your original complaint "ctags doesn't honour any CPP #if'ing".

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Makefile patch for cscope and saner Ctags

2001-05-31 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 The following patch generates saner Ctags and will build cscope
 output.  It's against 2.4.5
 
 --- Makefile.old  Mon May 28 22:44:01 2001
 +++ Makefile  Wed May 30 17:50:01 2001
 @@ -334,11 +334,32 @@
  
  # Exuberant ctags works better with -I
  tags: dummy
 - CTAGSF=`ctags --version | grep -i exuberant /dev/null  echo -I 
__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS`; \
 + CTAGSF=`ctags --version | grep -i exuberant /dev/null  echo --sort=no -I 
__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS`; \
   ctags $$CTAGSF `find include/asm-$(ARCH) -name '*.h'`  \
 - find include -type d \( -name asm-* -o -name config \) -prune -o -name '*.h' 
-print | xargs ctags $$CTAGSF -a  \
 + find include -type f -name '*.h' -mindepth 2 -maxdepth 2 \
 + | grep -v include/asm- | grep -v include/config \
 + | xargs -r ctags $$CTAGSF -a  \
 + find include -type f -name '*.h' -mindepth 3 -maxdepth 3 \
 + | grep -v include/asm- | grep -v include/config \
 + | xargs -r ctags $$CTAGSF -a  \
 + find include -type f -name '*.h' -mindepth 4 -maxdepth 4 \
 + | grep -v include/asm- | grep -v include/config \
 + | xargs -r ctags $$CTAGSF -a  \
 + find include -type f -name '*.h' -mindepth 5 -maxdepth 5 \
 + | grep -v include/asm- | grep -v include/config \
 + | xargs -r ctags $$CTAGSF -a  \
   find $(SUBDIRS) init -name '*.c' | xargs ctags $$CTAGSF -a
 + mv tags tags.unsorted
 + LC_ALL=C sort -k 1,1 -s tags.unsorted  tags
 + rm tags.unsorted
  
 +cscope: dummy
 + find include/asm-$(ARCH) -name '*.h' cscope.files
 + find include $(SUBDIRS) init -type f -name '*.[ch]' \
 + | grep -v include/asm- | grep -v include/config  cscope.files
 + cscope -b -I include
 +
 + 
  ifdef CONFIG_MODULES
  ifdef CONFIG_MODVERSIONS
  MODFLAGS += -DMODVERSIONS -include $(HPATH)/linux/modversions.h

You seem not to have read my response to your earlier mail proprosing
such a thing (for tags only, not cscope):

http://boudicca.tux.org/hypermail/linux-kernel/2001week21/1869.html

How does the patch above fix anything?  You're sorting so that
include/linux/*.h comes before include/linux/{mtd,lockd,raid,...}/*.h,
but I don't see how that can be an improvement, or how it addresses
your original complaint ctags doesn't honour any CPP #if'ing.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pte_page

2001-05-30 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> On Wed, 30 May 2001 [EMAIL PROTECTED] wrote:
> > I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page'
> > inside a module to get the physical address of a user space virtual
> > address. The physical address returned by 'pte_page' is not page
> > aligned whereas the virtual address was page aligned. Can somebody
> > tell me the reason?
> 
> __pa(page_address(pte_page(pte))) is the address you want. [or
> pte_val(*pte) & (PAGE_SIZE-1) on x86 but this is platform-dependent.]

Does this work on x86 non-kmapped highmem user pages too?  (i.e. is
page->virtual valid for every potential user page.)

-- Pete

> > Also, can i use these functions to get the physical address of a
> > kernel virtual address using init_mm?
> 
> nope. Eg. on x86 these functions only walk normal 4K page pagetables, they
> do not walk 4MB pages correctly. (which are set up on pentiums and better
> CPUs, unless mem=nopentium is specified.)
> 
> a kernel virtual address can be decoded by simply doing __pa(kaddr). If
> the page is a highmem page [and you have the struct page pointer] then you
> can do [(page-mem_map) << PAGE_SHIFT] to get the physical address, but
> only on systems where mem_map[] starts at physical address 0.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: pte_page

2001-05-30 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 On Wed, 30 May 2001 [EMAIL PROTECTED] wrote:
  I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page'
  inside a module to get the physical address of a user space virtual
  address. The physical address returned by 'pte_page' is not page
  aligned whereas the virtual address was page aligned. Can somebody
  tell me the reason?
 
 __pa(page_address(pte_page(pte))) is the address you want. [or
 pte_val(*pte)  (PAGE_SIZE-1) on x86 but this is platform-dependent.]

Does this work on x86 non-kmapped highmem user pages too?  (i.e. is
page-virtual valid for every potential user page.)

-- Pete

  Also, can i use these functions to get the physical address of a
  kernel virtual address using init_mm?
 
 nope. Eg. on x86 these functions only walk normal 4K page pagetables, they
 do not walk 4MB pages correctly. (which are set up on pentiums and better
 CPUs, unless mem=nopentium is specified.)
 
 a kernel virtual address can be decoded by simply doing __pa(kaddr). If
 the page is a highmem page [and you have the struct page pointer] then you
 can do [(page-mem_map)  PAGE_SHIFT] to get the physical address, but
 only on systems where mem_map[] starts at physical address 0.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] minor lockd debugging typo

2001-05-29 Thread Pete Wyckoff

Rather than print proc twice, you probably intend to show the prog.
(At least, I had wanted to see the prog in the debugging output. :) )

-- Pete

diff -X dontdiff -ruN linux-2.4.5-kdb/fs/lockd/mon.c linux-2.4.5/fs/lockd/mon.c
--- linux-2.4.5-kdb/fs/lockd/mon.c  Mon Oct 16 15:58:51 2000
+++ linux-2.4.5/fs/lockd/mon.c  Tue May 29 15:52:17 2001
@@ -146,7 +146,7 @@
u32 addr = ntohl(argp->addr);
 
dprintk("nsm: xdr_encode_mon(%08x, %d, %d, %d)\n",
-   htonl(argp->addr), htonl(argp->proc),
+   htonl(argp->addr), htonl(argp->prog),
htonl(argp->vers), htonl(argp->proc));
 
/*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



[patch] minor lockd debugging typo

2001-05-29 Thread Pete Wyckoff

Rather than print proc twice, you probably intend to show the prog.
(At least, I had wanted to see the prog in the debugging output. :) )

-- Pete

diff -X dontdiff -ruN linux-2.4.5-kdb/fs/lockd/mon.c linux-2.4.5/fs/lockd/mon.c
--- linux-2.4.5-kdb/fs/lockd/mon.c  Mon Oct 16 15:58:51 2000
+++ linux-2.4.5/fs/lockd/mon.c  Tue May 29 15:52:17 2001
@@ -146,7 +146,7 @@
u32 addr = ntohl(argp-addr);
 
dprintk(nsm: xdr_encode_mon(%08x, %d, %d, %d)\n,
-   htonl(argp-addr), htonl(argp-proc),
+   htonl(argp-addr), htonl(argp-prog),
htonl(argp-vers), htonl(argp-proc));
 
/*
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: dget()

2001-05-19 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> # cd /usr/src/linux
> # find -name '*.[ch]' | ctags -L- &
>
> On 15 May 2001, Xavier Bestel wrote:
> > # cd /usr/src/linux
> > # make tags
> 
> No, I never use that one because it skips very useful entries like the
> ones from EXPORT_SYMBOL etc. Also, it only shows the current architecture.
> So, the tags target in the Makefile would only become useful when it is
> stripped of extra (unnecessary, imho) logic and turned into a plain one I
> suggested above.

You can use the -I feature to ignore some keywords, but not the
functions they modify.  Without ignoring __initdata, for instance, you
get a couple hundred tags for it, and none for the actual variable,
e.g. cpu_vendor_names.  My current ignore list for the kernel is at
http://www.osc.edu/~pw/ctags-ignore .
If you really want to see the EXPORT_SYMBOL(tag) lines, you'll want
to remove that one from the list (and be sure to do the sorting...),

Next, the wonderful editor Vim doesn't read my mind quite well enough.
When I want to see the tag for "page", for example, I really
want to look at the definition of the struct in include/linux/mm.h, not
at any of the 16 other places which declare a variable struct page *page;.
So I run the tags file through a perl script which percolates those
interesting items to the top when there are multiple entries for the
same identifier.  It's here: http://www.osc.edu/~pw/sort-tags .

Finally, I modify the Makefile to generate tags with these
modifications. Here's the snippet: 

tags: dummy
( find include/asm-$(ARCH) -name "*.h" ;\
  find include -type d \( -name "asm-*" -o -name config \) -prune \
  -o -name "*.h" -print ; find $(SUBDIRS) init -name '*.[chS]' ) |\
ctags -I../ctags-ignore -L - -f - | sort-tags > tags

Note that this only generates tags for your current architecture,
and that your ctags must be Exuberant and version >= 5.0.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: dget()

2001-05-19 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 # cd /usr/src/linux
 # find -name '*.[ch]' | ctags -L- 

 On 15 May 2001, Xavier Bestel wrote:
  # cd /usr/src/linux
  # make tags
 
 No, I never use that one because it skips very useful entries like the
 ones from EXPORT_SYMBOL etc. Also, it only shows the current architecture.
 So, the tags target in the Makefile would only become useful when it is
 stripped of extra (unnecessary, imho) logic and turned into a plain one I
 suggested above.

You can use the -I feature to ignore some keywords, but not the
functions they modify.  Without ignoring __initdata, for instance, you
get a couple hundred tags for it, and none for the actual variable,
e.g. cpu_vendor_names.  My current ignore list for the kernel is at
http://www.osc.edu/~pw/ctags-ignore .
If you really want to see the EXPORT_SYMBOL(tag) lines, you'll want
to remove that one from the list (and be sure to do the sorting...),

Next, the wonderful editor Vim doesn't read my mind quite well enough.
When I want to see the tag for page, for example, I really
want to look at the definition of the struct in include/linux/mm.h, not
at any of the 16 other places which declare a variable struct page *page;.
So I run the tags file through a perl script which percolates those
interesting items to the top when there are multiple entries for the
same identifier.  It's here: http://www.osc.edu/~pw/sort-tags .

Finally, I modify the Makefile to generate tags with these
modifications. Here's the snippet: 

tags: dummy
( find include/asm-$(ARCH) -name *.h ;\
  find include -type d \( -name asm-* -o -name config \) -prune \
  -o -name *.h -print ; find $(SUBDIRS) init -name '*.[chS]' ) |\
ctags -I../ctags-ignore -L - -f - | sort-tags  tags

Note that this only generates tags for your current architecture,
and that your ctags must be Exuberant and version = 5.0.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread 'Pete Wyckoff'

[EMAIL PROTECTED] said:
>   > But in the case of an application which fits in main memory, and
>   > has been running for a while (so all pages are present and
>   > dirty), all you'd really have to do is verify the page tables are
>   > in the proper state and skip the TLB flush, right?
> 
>   We really cannot assume this. There are two cases 
>   a. when a user app wants to receive some data, it allocates
> memory(using malloc) and waits for the hw to do zero-copy read. The kernel
> does not allocate physical page frames for the entire memory region
> allocated. We need to lock the memory (and locking is expensive due to
> costly TLB flushes) to do this
> 
>   b. when a user app wants to send data, he fills the buffer
> and waits for the hw to transmit data, but under heavy physical memory
> pressure, the swapper might swap the pages we want to transmit. So we need
> to lock the memory to be 100% sure.

You're right, of course.  But I suspect that the fast path of
re-locking memory which is happily in core will go much faster
by removing the multi-processor TLB purge.  And it can't hurt,
unless I'm missing something.

-- Pete

--- linux-2.4.4-stock/mm/mlock.cTue May  8 17:26:34 2001
+++ linux/mm/mlock.cTue May  8 17:24:13 2001
@@ -114,6 +114,10 @@
return 0;
 }
 
+/* implemented in mm/memory.c */
+extern int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end);
+
 static int mlock_fixup(struct vm_area_struct * vma, 
unsigned long start, unsigned long end, unsigned int newflags)
 {
@@ -138,7 +142,7 @@
pages = (end - start) >> PAGE_SHIFT;
if (newflags & VM_LOCKED) {
pages = -pages;
-   make_pages_present(start, end);
+   mlock_make_pages_present(vma, start, end);
}
vma->vm_mm->locked_vm -= pages;
}

--- linux-2.4.4-stock/mm/memory.c   Tue May  8 17:25:36 2001
+++ linux/mm/memory.c   Tue May  8 17:24:40 2001
@@ -1438,3 +1438,80 @@
} while (addr < end);
return 0;
 }
+
+/*
+ * Specialized version of make_pages_present which does not require
+ * a multi-processor TLB purge for every page if nothing about the PTE
+ * was modified.
+ */
+int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end)
+{
+   int ret, write;
+   struct mm_struct *mm = current->mm;
+
+   write = (vma->vm_flags & VM_WRITE) != 0;
+
+   /*
+* We need the page table lock to synchronize with kswapd
+* and the SMP-safe atomic PTE updates.
+*/
+   spin_lock(>page_table_lock);
+
+   ret = 0;
+   for (ret=0; !ret && addr < end; addr += PAGE_SIZE) {
+   pgd_t *pgd;
+   pmd_t *pmd;
+   pte_t *pte, entry;
+   int modified;
+
+   current->state = TASK_RUNNING;
+   pgd = pgd_offset(mm, addr);
+   pmd = pmd_alloc(mm, pgd, addr);
+   if (!pmd) {
+   ret = -1;
+   break;
+   }
+   pte = pte_alloc(mm, pmd, addr);
+   if (!pte) {
+   ret = -1;
+   break;
+   }
+   entry = *pte;
+   if (!pte_present(entry)) {
+   /*
+* If it truly wasn't present, we know that kswapd
+* and the PTE updates will not touch it later. So
+* drop the lock.
+*/
+   if (pte_none(entry)) {
+   ret = do_no_page(mm, vma, addr, write, pte);
+   continue;
+   }
+   ret = do_swap_page(mm, vma, addr, pte,
+   pte_to_swp_entry(entry), write);
+   continue;
+   }
+
+   modified = 0;
+   if (write) {
+   if (!pte_write(entry)) {
+   ret = do_wp_page(mm, vma, addr, pte, entry);
+   continue;
+   }
+   if (!pte_dirty(entry)) {
+   entry = pte_mkdirty(entry);
+   modified = 1;
+   }
+   }
+   if (!pte_young(entry)) {
+   entry = pte_mkyoung(entry);
+   modified = 1;
+   }
+   if (modified)
+   establish_pte(vma, addr, pte, entry);
+   }
+
+   spin_unlock(>page_table_lock);
+   return ret;
+}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> > A couple of concerns I have:
> >  * How to pin or pagelock the application buffer without
> > making a kernel transition.
> 
> You need to pin them in advance. And pinning pages is _expensive_ so you dont
> want to keep pinning/unpinning pages

I can't convince myself why this has to be so expensive.  The
current implementation does this for mlock:

1.  Split vma if only a subset of the pages are being locked.
2.  Mark bit in vma.
3.  Make sure the pages are in core.

That third step has the potential of being the most expensive,
as changing the page tables requires invalidating the TLBs on all
processors.  Currently make_pages_present() does the work for 3.

But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?

Then 3 turns into a single spin_lock pair for the page_table_lock, 
and walking down the page table.

The VMA splitting can be nasty, as it might require a couple of
slab allocations, and doing an AVL insertion.  (More nastiness in
the case of shared memory or file mapping, too.)  But nothing
like playing with TLBs.

Any reason why make_pages_present() is not the really oversized
hammer it seems to be?

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
  A couple of concerns I have:
   * How to pin or pagelock the application buffer without
  making a kernel transition.
 
 You need to pin them in advance. And pinning pages is _expensive_ so you dont
 want to keep pinning/unpinning pages

I can't convince myself why this has to be so expensive.  The
current implementation does this for mlock:

1.  Split vma if only a subset of the pages are being locked.
2.  Mark bit in vma.
3.  Make sure the pages are in core.

That third step has the potential of being the most expensive,
as changing the page tables requires invalidating the TLBs on all
processors.  Currently make_pages_present() does the work for 3.

But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?

Then 3 turns into a single spin_lock pair for the page_table_lock, 
and walking down the page table.

The VMA splitting can be nasty, as it might require a couple of
slab allocations, and doing an AVL insertion.  (More nastiness in
the case of shared memory or file mapping, too.)  But nothing
like playing with TLBs.

Any reason why make_pages_present() is not the really oversized
hammer it seems to be?

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [RFC] Direct Sockets Support??

2001-05-08 Thread 'Pete Wyckoff'

[EMAIL PROTECTED] said:
But in the case of an application which fits in main memory, and
has been running for a while (so all pages are present and
dirty), all you'd really have to do is verify the page tables are
in the proper state and skip the TLB flush, right?
 
   We really cannot assume this. There are two cases 
   a. when a user app wants to receive some data, it allocates
 memory(using malloc) and waits for the hw to do zero-copy read. The kernel
 does not allocate physical page frames for the entire memory region
 allocated. We need to lock the memory (and locking is expensive due to
 costly TLB flushes) to do this
 
   b. when a user app wants to send data, he fills the buffer
 and waits for the hw to transmit data, but under heavy physical memory
 pressure, the swapper might swap the pages we want to transmit. So we need
 to lock the memory to be 100% sure.

You're right, of course.  But I suspect that the fast path of
re-locking memory which is happily in core will go much faster
by removing the multi-processor TLB purge.  And it can't hurt,
unless I'm missing something.

-- Pete

--- linux-2.4.4-stock/mm/mlock.cTue May  8 17:26:34 2001
+++ linux/mm/mlock.cTue May  8 17:24:13 2001
@@ -114,6 +114,10 @@
return 0;
 }
 
+/* implemented in mm/memory.c */
+extern int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end);
+
 static int mlock_fixup(struct vm_area_struct * vma, 
unsigned long start, unsigned long end, unsigned int newflags)
 {
@@ -138,7 +142,7 @@
pages = (end - start)  PAGE_SHIFT;
if (newflags  VM_LOCKED) {
pages = -pages;
-   make_pages_present(start, end);
+   mlock_make_pages_present(vma, start, end);
}
vma-vm_mm-locked_vm -= pages;
}

--- linux-2.4.4-stock/mm/memory.c   Tue May  8 17:25:36 2001
+++ linux/mm/memory.c   Tue May  8 17:24:40 2001
@@ -1438,3 +1438,80 @@
} while (addr  end);
return 0;
 }
+
+/*
+ * Specialized version of make_pages_present which does not require
+ * a multi-processor TLB purge for every page if nothing about the PTE
+ * was modified.
+ */
+int mlock_make_pages_present(struct vm_area_struct *vma,
+   unsigned long addr, unsigned long end)
+{
+   int ret, write;
+   struct mm_struct *mm = current-mm;
+
+   write = (vma-vm_flags  VM_WRITE) != 0;
+
+   /*
+* We need the page table lock to synchronize with kswapd
+* and the SMP-safe atomic PTE updates.
+*/
+   spin_lock(mm-page_table_lock);
+
+   ret = 0;
+   for (ret=0; !ret  addr  end; addr += PAGE_SIZE) {
+   pgd_t *pgd;
+   pmd_t *pmd;
+   pte_t *pte, entry;
+   int modified;
+
+   current-state = TASK_RUNNING;
+   pgd = pgd_offset(mm, addr);
+   pmd = pmd_alloc(mm, pgd, addr);
+   if (!pmd) {
+   ret = -1;
+   break;
+   }
+   pte = pte_alloc(mm, pmd, addr);
+   if (!pte) {
+   ret = -1;
+   break;
+   }
+   entry = *pte;
+   if (!pte_present(entry)) {
+   /*
+* If it truly wasn't present, we know that kswapd
+* and the PTE updates will not touch it later. So
+* drop the lock.
+*/
+   if (pte_none(entry)) {
+   ret = do_no_page(mm, vma, addr, write, pte);
+   continue;
+   }
+   ret = do_swap_page(mm, vma, addr, pte,
+   pte_to_swp_entry(entry), write);
+   continue;
+   }
+
+   modified = 0;
+   if (write) {
+   if (!pte_write(entry)) {
+   ret = do_wp_page(mm, vma, addr, pte, entry);
+   continue;
+   }
+   if (!pte_dirty(entry)) {
+   entry = pte_mkdirty(entry);
+   modified = 1;
+   }
+   }
+   if (!pte_young(entry)) {
+   entry = pte_mkyoung(entry);
+   modified = 1;
+   }
+   if (modified)
+   establish_pte(vma, addr, pte, entry);
+   }
+
+   spin_unlock(mm-page_table_lock);
+   return ret;
+}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ 

Re: Question on mmap(2) with kernel alocated memory

2001-05-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
> I am trying to mmap() into user space a kernel buffer  and am having
> problems.
> I have a simple test example, can someone please tell me what I have got
> wrong ?
> 
> In a driver I do:
> uint*kva;
> 
> kva = (uint*)kmalloc(4096, GFP_KERNEL);
> *kva = 0x11223344;
> printk("Address: %p %lx %x\n", kva, virt_to_phys(kva), *kva);
> 
> Now in some simple user program I do:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main(int argc, char** argv){
>  int fm;
>  char* p;
>  uint* pi;
>  uint v;
>  uint add = 0x74b000;
> 
>  if((fm = open("/dev/mem", O_RDWR)) < 0)
>   return 1;
> 
>  p = mmap(0, 128 * 1024 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fm,
> 0);
>  printf("Mapped: %p\n", p);
> 
>  lseek(fm, add, SEEK_SET);
>  read(fm, , sizeof(v));
>  printf("V: %x\n", v);
> 
>  pi = (uint*)(p + add);
>  printf("Vmmap: %p %x\n", pi, *pi);
> 
>  close(fm);
>  return 0;
> }
> 
> The value of add is hardcoded to the value printed for the physical
> address in the drivers prink routine.
> The lseek/read from the /dev/mem device yields the value 0x11223344.
> However the mmap method also on /dev/mem yields the value 0.
> 
> Whats wrong with my mmap() or kalloc() ?

Executive summary:

mmap of /dev/mem gives different values than lseek/read of /dev/mem

Using lseek/read gives back bits of physical memory, but dereferencing
a pointer into the mmaped area mostly gives zeroes.

The file-system specific mmap routine, mmap_mem, calls
remap_page_range to do the work of remapping the physical pages
into user space.  But eventually in remap_pte_range there's a check

if ((!VALID_PAGE(page)) || PageReserved(page))
set_pte(pte, mk_pte_phys(phys_addr, prot));

And the effect is that only the reserved pages and those outside
of the physical memory space get mapped.  This isn't intuitive for
/dev/mem, but is it the intended behavior?

You could replicate those 80 odd lines of remap_page_range and helpers
to get a version without the PageReserved test.  Yuk.

Quick hack:
SetPageReserved(virt_to_page(kva));

but you may want to undo that (Clear...) before kfree-ing the page,
as I'm not sure what will be confused by that.

Longer term solution:  write your own mmap routine in the driver,
have the user code open /dev/my-driver, and mmap() on that instead
of going through /dev/mem with a hardcoded address.

-- Pete
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Question on mmap(2) with kernel alocated memory

2001-05-05 Thread Pete Wyckoff

[EMAIL PROTECTED] said:
 I am trying to mmap() into user space a kernel buffer  and am having
 problems.
 I have a simple test example, can someone please tell me what I have got
 wrong ?
 
 In a driver I do:
 uint*kva;
 
 kva = (uint*)kmalloc(4096, GFP_KERNEL);
 *kva = 0x11223344;
 printk(Address: %p %lx %x\n, kva, virt_to_phys(kva), *kva);
 
 Now in some simple user program I do:
 
 #include stdio.h
 #include string.h
 #include stdlib.h
 #include sys/mman.h
 #include fcntl.h
 
 int main(int argc, char** argv){
  int fm;
  char* p;
  uint* pi;
  uint v;
  uint add = 0x74b000;
 
  if((fm = open(/dev/mem, O_RDWR))  0)
   return 1;
 
  p = mmap(0, 128 * 1024 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fm,
 0);
  printf(Mapped: %p\n, p);
 
  lseek(fm, add, SEEK_SET);
  read(fm, v, sizeof(v));
  printf(V: %x\n, v);
 
  pi = (uint*)(p + add);
  printf(Vmmap: %p %x\n, pi, *pi);
 
  close(fm);
  return 0;
 }
 
 The value of add is hardcoded to the value printed for the physical
 address in the drivers prink routine.
 The lseek/read from the /dev/mem device yields the value 0x11223344.
 However the mmap method also on /dev/mem yields the value 0.
 
 Whats wrong with my mmap() or kalloc() ?

Executive summary:

mmap of /dev/mem gives different values than lseek/read of /dev/mem

Using lseek/read gives back bits of physical memory, but dereferencing
a pointer into the mmaped area mostly gives zeroes.

The file-system specific mmap routine, mmap_mem, calls
remap_page_range to do the work of remapping the physical pages
into user space.  But eventually in remap_pte_range there's a check

if ((!VALID_PAGE(page)) || PageReserved(page))
set_pte(pte, mk_pte_phys(phys_addr, prot));

And the effect is that only the reserved pages and those outside
of the physical memory space get mapped.  This isn't intuitive for
/dev/mem, but is it the intended behavior?

You could replicate those 80 odd lines of remap_page_range and helpers
to get a version without the PageReserved test.  Yuk.

Quick hack:
SetPageReserved(virt_to_page(kva));

but you may want to undo that (Clear...) before kfree-ing the page,
as I'm not sure what will be confused by that.

Longer term solution:  write your own mmap routine in the driver,
have the user code open /dev/my-driver, and mmap() on that instead
of going through /dev/mem with a hardcoded address.

-- Pete
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Is jumbo ethernet MTU possible with Hamachi Gigabit ethernet driver?

2001-04-05 Thread Pete Wyckoff

The hardware also must support jumbo MTUs.  Hamachi limit is 1518 or
1522 (vlan) bytes, and the driver can't fix that.

-- Pete

[EMAIL PROTECTED] said:
> I'm trying to get some Gigabit ethernet cards that use the Packet
> Engines Hamachi GNIC-II chip to use a large mtu to attempt to get a
> throughput of close to the 1Gb rating of the card.  This is on a Compact
> PCI Alpha system.  I'm trying to use an MTU in the 8000 to 9000 range
> and so far have not been able to get these MTUs to work.
> 
> I have changed the PKT_BUF_SZ and MAX_FRAME_SIZE constants in hamachi.c
> and ETH_DATA_LEN and ETH_FRAME_LEN in if_ether.h.  I can use ifconfig to
> change the MTU above 1500 on one side of a connection but as soon as I
> raise the MTU on both sides to greater than 1500 the link dies.  I can
> change the MTU with ifconfig back to 1500 and the link will resume
> operation.  We are currently somewhat married to the 2.2.14 kernel.
> 
> I read that some ethernet drivers will support jumbo MTUs.  There
> appears to be something in the hamachi driver or the kernel that I've
> missed.  Perhaps this only works with a later version kernel or the
> hamachi driver needs more changes?  Any help would be appreciated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Is jumbo ethernet MTU possible with Hamachi Gigabit ethernet driver?

2001-04-05 Thread Pete Wyckoff

The hardware also must support jumbo MTUs.  Hamachi limit is 1518 or
1522 (vlan) bytes, and the driver can't fix that.

-- Pete

[EMAIL PROTECTED] said:
 I'm trying to get some Gigabit ethernet cards that use the Packet
 Engines Hamachi GNIC-II chip to use a large mtu to attempt to get a
 throughput of close to the 1Gb rating of the card.  This is on a Compact
 PCI Alpha system.  I'm trying to use an MTU in the 8000 to 9000 range
 and so far have not been able to get these MTUs to work.
 
 I have changed the PKT_BUF_SZ and MAX_FRAME_SIZE constants in hamachi.c
 and ETH_DATA_LEN and ETH_FRAME_LEN in if_ether.h.  I can use ifconfig to
 change the MTU above 1500 on one side of a connection but as soon as I
 raise the MTU on both sides to greater than 1500 the link dies.  I can
 change the MTU with ifconfig back to 1500 and the link will resume
 operation.  We are currently somewhat married to the 2.2.14 kernel.
 
 I read that some ethernet drivers will support jumbo MTUs.  There
 appears to be something in the hamachi driver or the kernel that I've
 missed.  Perhaps this only works with a later version kernel or the
 hamachi driver needs more changes?  Any help would be appreciated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



ns558 pci/isa change

2000-11-13 Thread Pete Wyckoff

My joystick stopped working at 2.4.0-test10 due to a patch to
drivers/char/joystick/ns558.c that moves pci probing ahead of
isa probing.

The problem is that pci_module_init can return -ENODEV into i,
which is then used to index the ISA portlist.  ISA probing assumes
that i is initialized to zero.

Trivial fix attached, with plenty of context.

-- Pete

--- drivers/char/joystick/ns558.c.orig  Mon Nov 13 18:04:16 2000
+++ drivers/char/joystick/ns558.c   Mon Nov 13 18:11:41 2000
@@ -299,37 +299,38 @@
 deactivate:
if (dev->deactivate)
dev->deactivate(dev);
return next;
 }
 #endif
 
 int __init ns558_init(void)
 {
-   int i = 0;
+   int i;
 #ifdef NSS558_ISAPNP
struct pci_dev *dev = NULL;
struct pnp_devid *devid;
 #endif
 
 /*
  * Probe for PCI ports.  Always probe for PCI first,
  * it is the least-invasive probe.
  */
 
i = pci_module_init(_pci_driver);
if (i == 0)
have_pci_devices = 1;
 
 /*
  * Probe for ISA ports.
  */
 
+   i = 0;
while (ns558_isa_portlist[i]) 
ns558 = ns558_isa_probe(ns558_isa_portlist[i++], ns558);
 
 /*
  * Probe for PnP ports.
  */
 
 #ifdef NSS558_ISAPNP
for (devid = pnp_devids; devid->vendor; devid++) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



ns558 pci/isa change

2000-11-13 Thread Pete Wyckoff

My joystick stopped working at 2.4.0-test10 due to a patch to
drivers/char/joystick/ns558.c that moves pci probing ahead of
isa probing.

The problem is that pci_module_init can return -ENODEV into i,
which is then used to index the ISA portlist.  ISA probing assumes
that i is initialized to zero.

Trivial fix attached, with plenty of context.

-- Pete

--- drivers/char/joystick/ns558.c.orig  Mon Nov 13 18:04:16 2000
+++ drivers/char/joystick/ns558.c   Mon Nov 13 18:11:41 2000
@@ -299,37 +299,38 @@
 deactivate:
if (dev-deactivate)
dev-deactivate(dev);
return next;
 }
 #endif
 
 int __init ns558_init(void)
 {
-   int i = 0;
+   int i;
 #ifdef NSS558_ISAPNP
struct pci_dev *dev = NULL;
struct pnp_devid *devid;
 #endif
 
 /*
  * Probe for PCI ports.  Always probe for PCI first,
  * it is the least-invasive probe.
  */
 
i = pci_module_init(ns558_pci_driver);
if (i == 0)
have_pci_devices = 1;
 
 /*
  * Probe for ISA ports.
  */
 
+   i = 0;
while (ns558_isa_portlist[i]) 
ns558 = ns558_isa_probe(ns558_isa_portlist[i++], ns558);
 
 /*
  * Probe for PnP ports.
  */
 
 #ifdef NSS558_ISAPNP
for (devid = pnp_devids; devid-vendor; devid++) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/