[PATCH 2/2] ib fmr pool: flush used clean entries
Commit a3cd7d9070be417a21905c997ee32d756d999b38 (IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs) caused a regression for iSER and was reverted in e5507736c6449b3253347eed6f8ea77a28cf688e. This change attempts to redo the original patch so that all used FMR entries are flushed when ib_flush_fmr_pool() is called, and other FMR users are not affected. Simply move used entries from the clean list onto the dirty list before letting the cleanup thread do its job. Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]> --- drivers/infiniband/core/fmr_pool.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 4044fdf..06d502c 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -398,8 +398,23 @@ EXPORT_SYMBOL(ib_destroy_fmr_pool); */ int ib_flush_fmr_pool(struct ib_fmr_pool *pool) { - int serial = atomic_inc_return(>req_ser); + int serial; + struct ib_pool_fmr *fmr, *next; + + /* +* The free_list holds FMRs that may have been used +* but have not been remapped enough times to be dirty. +* Put them on the dirty list now so that the cleanup +* thread will reap them too. +*/ + spin_lock_irq(>pool_lock); + list_for_each_entry_safe(fmr, next, >free_list, list) { + if (fmr->remap_count > 0) + list_move(>list, >dirty_list); + } + spin_unlock_irq(>pool_lock); + serial = atomic_inc_return(>req_ser); wake_up_process(pool->thread); if (wait_event_interruptible(pool->force_wait, -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Revert "IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs"
This reverts commit a3cd7d9070be417a21905c997ee32d756d999b38. The original commit breaks iSER reliably, making it complain: iser: iser_reg_page_vec:ib_fmr_pool_map_phys failed: -11 The FMR cleanup thread runs ib_fmr_batch_release() as dirty entries build up. This commit causes clean but used FMR entries also to be purged. During that process, another thread can see that there are no free FMRs and fail, even though there should always have been enough available. Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]> --- drivers/infiniband/core/fmr_pool.c | 21 ++--- 1 files changed, 6 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 7f00347..4044fdf 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -139,7 +139,7 @@ static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, static void ib_fmr_batch_release(struct ib_fmr_pool *pool) { int ret; - struct ib_pool_fmr *fmr, *next; + struct ib_pool_fmr *fmr; LIST_HEAD(unmap_list); LIST_HEAD(fmr_list); @@ -158,20 +158,6 @@ static void ib_fmr_batch_release(struct ib_fmr_pool *pool) #endif } - /* -* The free_list may hold FMRs that have been put there -* because they haven't reached the max_remap count. -* Invalidate their mapping as well. -*/ - list_for_each_entry_safe(fmr, next, >free_list, list) { - if (fmr->remap_count == 0) - continue; - hlist_del_init(>cache_node); - fmr->remap_count = 0; - list_add_tail(>fmr->list, _list); - list_move(>list, _list); - } - list_splice(>dirty_list, _list); INIT_LIST_HEAD(>dirty_list); pool->dirty_len = 0; @@ -384,6 +370,11 @@ void ib_destroy_fmr_pool(struct ib_fmr_pool *pool) i = 0; list_for_each_entry_safe(fmr, tmp, >free_list, list) { + if (fmr->remap_count) { + INIT_LIST_HEAD(_list); + list_add_tail(>fmr->list, _list); + ib_unmap_fmr(_list); + } ib_dealloc_fmr(fmr->fmr); list_del(>list); kfree(fmr); -- 1.5.4.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] (was Re: [ofa-general] fmr pool free_list empty)
[EMAIL PROTECTED] wrote on Mon, 25 Feb 2008 15:02 -0800: > Ugh. [pw wrote:] > > Looking at the FMR dirty list unmapping code in > > ib_fmr_batch_release(), there is a section that pulls all the dirty > > entries onto a list that it will later unmap and put back on the > > free list. > > > But it also plans to unmap all the free entries that have ever been > > remapped: > > Yes, this came from a3cd7d90 ("IB/fmr_pool: ib_fmr_pool_flush() should > flush all dirty FMRs"). That solved a real problem for Olaf, because > otherwise dirty FMRs with not at the max map count might never get > invalidated. It's not exactly an optimization but rather a > correctness issue, because RDS relies on killing mapping eventually. > > On the other hand, this behavior clearly does lead to the possibility > of leaving the free list temporarily empty for stupid reasons. > > I don't see a really good way to fix this at the momemnt, need to > meditate a little. Adding CCs in case some iser users are not on the openfabrics list. Original message is here: http://lists.openfabrics.org/pipermail/general/2008-February/047111.html This quoted commit is a regression for iSER. Not sure if it causes problems for the other FMR user, SRP. It went in after v2.6.24. Following this mail are two patches. One to revert the change, and one to attempt to do Olaf's patch in such a way that it does not cause problems for other FMR users. I haven't tested the patches with RDS. It apparently isn't in the tree yet. In fact, there are no users of ib_flush_fmr_pool() in the tree, which is the only function affected by the second patch. But iSER is working again in my scenario. As a side note, I don't remember seeing this patch on the openfabrics mailing list. Perhaps I missed it. Sometimes these sorts of interactions can be spotted if proposed changes get wider attention. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] (was Re: [ofa-general] fmr pool free_list empty)
[EMAIL PROTECTED] wrote on Mon, 25 Feb 2008 15:02 -0800: Ugh. [pw wrote:] Looking at the FMR dirty list unmapping code in ib_fmr_batch_release(), there is a section that pulls all the dirty entries onto a list that it will later unmap and put back on the free list. But it also plans to unmap all the free entries that have ever been remapped: Yes, this came from a3cd7d90 (IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs). That solved a real problem for Olaf, because otherwise dirty FMRs with not at the max map count might never get invalidated. It's not exactly an optimization but rather a correctness issue, because RDS relies on killing mapping eventually. On the other hand, this behavior clearly does lead to the possibility of leaving the free list temporarily empty for stupid reasons. I don't see a really good way to fix this at the momemnt, need to meditate a little. Adding CCs in case some iser users are not on the openfabrics list. Original message is here: http://lists.openfabrics.org/pipermail/general/2008-February/047111.html This quoted commit is a regression for iSER. Not sure if it causes problems for the other FMR user, SRP. It went in after v2.6.24. Following this mail are two patches. One to revert the change, and one to attempt to do Olaf's patch in such a way that it does not cause problems for other FMR users. I haven't tested the patches with RDS. It apparently isn't in the tree yet. In fact, there are no users of ib_flush_fmr_pool() in the tree, which is the only function affected by the second patch. But iSER is working again in my scenario. As a side note, I don't remember seeing this patch on the openfabrics mailing list. Perhaps I missed it. Sometimes these sorts of interactions can be spotted if proposed changes get wider attention. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Revert IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs
This reverts commit a3cd7d9070be417a21905c997ee32d756d999b38. The original commit breaks iSER reliably, making it complain: iser: iser_reg_page_vec:ib_fmr_pool_map_phys failed: -11 The FMR cleanup thread runs ib_fmr_batch_release() as dirty entries build up. This commit causes clean but used FMR entries also to be purged. During that process, another thread can see that there are no free FMRs and fail, even though there should always have been enough available. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/core/fmr_pool.c | 21 ++--- 1 files changed, 6 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 7f00347..4044fdf 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -139,7 +139,7 @@ static inline struct ib_pool_fmr *ib_fmr_cache_lookup(struct ib_fmr_pool *pool, static void ib_fmr_batch_release(struct ib_fmr_pool *pool) { int ret; - struct ib_pool_fmr *fmr, *next; + struct ib_pool_fmr *fmr; LIST_HEAD(unmap_list); LIST_HEAD(fmr_list); @@ -158,20 +158,6 @@ static void ib_fmr_batch_release(struct ib_fmr_pool *pool) #endif } - /* -* The free_list may hold FMRs that have been put there -* because they haven't reached the max_remap count. -* Invalidate their mapping as well. -*/ - list_for_each_entry_safe(fmr, next, pool-free_list, list) { - if (fmr-remap_count == 0) - continue; - hlist_del_init(fmr-cache_node); - fmr-remap_count = 0; - list_add_tail(fmr-fmr-list, fmr_list); - list_move(fmr-list, unmap_list); - } - list_splice(pool-dirty_list, unmap_list); INIT_LIST_HEAD(pool-dirty_list); pool-dirty_len = 0; @@ -384,6 +370,11 @@ void ib_destroy_fmr_pool(struct ib_fmr_pool *pool) i = 0; list_for_each_entry_safe(fmr, tmp, pool-free_list, list) { + if (fmr-remap_count) { + INIT_LIST_HEAD(fmr_list); + list_add_tail(fmr-fmr-list, fmr_list); + ib_unmap_fmr(fmr_list); + } ib_dealloc_fmr(fmr-fmr); list_del(fmr-list); kfree(fmr); -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] ib fmr pool: flush used clean entries
Commit a3cd7d9070be417a21905c997ee32d756d999b38 (IB/fmr_pool: ib_fmr_pool_flush() should flush all dirty FMRs) caused a regression for iSER and was reverted in e5507736c6449b3253347eed6f8ea77a28cf688e. This change attempts to redo the original patch so that all used FMR entries are flushed when ib_flush_fmr_pool() is called, and other FMR users are not affected. Simply move used entries from the clean list onto the dirty list before letting the cleanup thread do its job. Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- drivers/infiniband/core/fmr_pool.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/core/fmr_pool.c b/drivers/infiniband/core/fmr_pool.c index 4044fdf..06d502c 100644 --- a/drivers/infiniband/core/fmr_pool.c +++ b/drivers/infiniband/core/fmr_pool.c @@ -398,8 +398,23 @@ EXPORT_SYMBOL(ib_destroy_fmr_pool); */ int ib_flush_fmr_pool(struct ib_fmr_pool *pool) { - int serial = atomic_inc_return(pool-req_ser); + int serial; + struct ib_pool_fmr *fmr, *next; + + /* +* The free_list holds FMRs that may have been used +* but have not been remapped enough times to be dirty. +* Put them on the dirty list now so that the cleanup +* thread will reap them too. +*/ + spin_lock_irq(pool-pool_lock); + list_for_each_entry_safe(fmr, next, pool-free_list, list) { + if (fmr-remap_count 0) + list_move(fmr-list, pool-dirty_list); + } + spin_unlock_irq(pool-pool_lock); + serial = atomic_inc_return(pool-req_ser); wake_up_process(pool-thread); if (wait_event_interruptible(pool-force_wait, -- 1.5.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: Demand paging for memory regions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800: > One other area that has not been brought up yet (I think) is the > applicability of notifiers in letting users know when pinned memory > is reclaimed by the kernel. This is useful when a lower-level > library employs lazy deregistration strategies on memory regions that > are subsequently released to the kernel via the application's use of > munmap or sbrk. Ohio Supercomputing Center has work in this area but > a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: Demand paging for memory regions
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800: One other area that has not been brought up yet (I think) is the applicability of notifiers in letting users know when pinned memory is reclaimed by the kernel. This is useful when a lower-level library employs lazy deregistration strategies on memory regions that are subsequently released to the kernel via the application's use of munmap or sbrk. Ohio Supercomputing Center has work in this area but a generalized approach in the kernel would certainly be welcome. The whole need for memory registration is a giant pain. There is no motivating application need for it---it is simply a hack around virtual memory and the lack of full VM support in current hardware. There are real hardware issues that interact poorly with virtual memory, as discussed previously in this thread. The way a messaging cycle goes in IB is: register buf post send from buf wait for completion deregister buf This tends to get hidden via userspace software libraries into a single call: MPI_send(buf) Now if you actually do the reg/dereg every time, things are very slow. So userspace library writers came up with the idea of caching registrations: if buf is not registered: register buf post send from buf wait for completion The second time that the app happens to do a send from the same buffer, it proceeds much faster. Spatial locality applies here, and this caching is generally worth it. Some libraries have schemes to limit the size of the registration cache too. But there are plenty of ways to hurt yourself with such a scheme. The first being a huge pool of unused but registered memory, as the library doesn't know the app patterns, and it doesn't know the VM pressure level in the kernel. There are plenty of subtle ways that this breaks too. If the registered buf is removed from the address space via munmap() or sbrk() or other ways, the mapping and registration are gone, but the library has no way of knowing that the app just did this. Sure the physical page is still there and pinned, but the app cannot get at it. Later if new address space arrives at the same virtual address but a different physical page, the library will mistakenly think it already has it registered properly, and data is transferred from this old now-unmapped physical page. The whole situation is rather ridiculuous, but we are quite stuck with it for current generation IB and iWarp hardware. If we can't have the kernel interact with the device directly, we could at least manage state in these multiple userspace registration caches. The VM could ask for certain (or any) pages to be released, and the library would respond if they are indeed not in use by the device. The app itself does not know about pinned regions, and the library is aware of exactly which regions are potentially in use. Since the great majority of userspace messaging over IB goes through middleware like MPI or PGAS languages, and they all have the same approach to registration caching, this approach could fix the problem for a big segment of use cases. More text on the registration caching problem is here: http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf with an approach using vm_ops open and close operations in a kernel module here: http://www.osc.edu/~pw/dreg/ There is a place for VM notifiers in RDMA messaging, but not in talking to devices, at least not the current set. If you can define a reasonable userspace interface for VM notifiers, libraries can manage registration caches more efficiently, letting the kernel unmap pinned pages as it likes. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
[EMAIL PROTECTED] wrote on Wed, 30 Jan 2008 10:34 -0600: > On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: > > On Jan 30, 2008 12:32 AM, FUJITA Tomonori <[EMAIL PROTECTED]> wrote: > > > > > > iSER has parameters to limit the maximum size of RDMA (it needs to > > > repeat RDMA with a poor configuration)? > > > > Please specify which parameters you are referring to. As you know I > > had already repeated my tests with ridiculously high values for the > > following iSER parameters: FirstBurstLength, MaxBurstLength and > > MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block > > size specified to dd). > > the 1Mb block size is a bit of a red herring. Unless you've > specifically increased the max_sector_size and are using an sg_chain > converted driver, on x86 the maximum possible transfer accumulation is > 0.5MB. > > I certainly don't rule out that increasing the transfer size up from > 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s > theoretical peak, that's roughly 2000 context switches per I/O; however, > It doesn't look like you've done anything that will overcome the block > layer limitations. The MRDSL parameter has no effect on iSER, as the RFC describes. How to transfer data to satisfy a command is solely up to the target. So you would need both big requests from the client, then look at how the target will send the data. I've only used 512 kB for the RDMA transfer size from the target, as it matches the default client size and was enough to get good performance out of my IB gear and minimizes resource consumption on the target. It's currently hard-coded as a #define. There is no provision in the protocol for the client to dictate the value. If others want to spend some effort trying to tune stgt for iSER, there are a fair number of comments in the code, including a big one that explains this RDMA transfer size issue. And I'll answer informed questions as I can. But I'm not particularly interested in arguing about which implementation is best, or trying to interpret bandwidth comparison numbers from poorly designed tests. It takes work to understand these issues. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Integration of SCST in the mainstream Linux kernel
[EMAIL PROTECTED] wrote on Wed, 30 Jan 2008 10:34 -0600: On Wed, 2008-01-30 at 09:38 +0100, Bart Van Assche wrote: On Jan 30, 2008 12:32 AM, FUJITA Tomonori [EMAIL PROTECTED] wrote: iSER has parameters to limit the maximum size of RDMA (it needs to repeat RDMA with a poor configuration)? Please specify which parameters you are referring to. As you know I had already repeated my tests with ridiculously high values for the following iSER parameters: FirstBurstLength, MaxBurstLength and MaxRecvDataSegmentLength (16 MB, which is more than the 1 MB block size specified to dd). the 1Mb block size is a bit of a red herring. Unless you've specifically increased the max_sector_size and are using an sg_chain converted driver, on x86 the maximum possible transfer accumulation is 0.5MB. I certainly don't rule out that increasing the transfer size up from 0.5MB might be the way to improve STGT efficiency, since at an 1GB/s theoretical peak, that's roughly 2000 context switches per I/O; however, It doesn't look like you've done anything that will overcome the block layer limitations. The MRDSL parameter has no effect on iSER, as the RFC describes. How to transfer data to satisfy a command is solely up to the target. So you would need both big requests from the client, then look at how the target will send the data. I've only used 512 kB for the RDMA transfer size from the target, as it matches the default client size and was enough to get good performance out of my IB gear and minimizes resource consumption on the target. It's currently hard-coded as a #define. There is no provision in the protocol for the client to dictate the value. If others want to spend some effort trying to tune stgt for iSER, there are a fair number of comments in the code, including a big one that explains this RDMA transfer size issue. And I'll answer informed questions as I can. But I'm not particularly interested in arguing about which implementation is best, or trying to interpret bandwidth comparison numbers from poorly designed tests. It takes work to understand these issues. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nosmp/maxcpus=0 or 1 -> TSC unstable
[EMAIL PROTECTED] wrote on Wed, 16 Jan 2008 16:31 +0100: > > * Pete Wyckoff <[EMAIL PROTECTED]> wrote: > > > > pretty sure this is the culprit is that num_possible_cpus() > 1, > > > which would mean cpu_possible_map contains the second cpu... but i'm > > > not quite sure what the right fix is... or perhaps this is all > > > intended. > > > > We've seen the same problem. We use gettimeofday() for timing of > > network-ish operations on the order of 10-50 us. But not having the > > TSC makes gettimeofday() itself very slow, on the order of 30 us. > > 30 usecs is too much - even with pmtimer it's typically below 5 usecs. > Could you run this on your box: > > http://people.redhat.com/mingo/time-warp-test/time-warp-test.c > > and send back what it reports? (run it for a few minutes) You're right. That 30 us comes from an old comment. Testing with your code shows under 5 us as you expected. I had to hack out the #ifndef i386 error; compiling on x86-64. Dual-socket 2.4 GHz Opteron 250. TSC-warps grows continually in all situations. On 2.6.24-rc6 + scsi-misc + random stuff, 2 processors: | 0.39 us, TSC-warps:983002775 | 4.81 us, TOD-warps:0 | 4.81 us, CLOCK-warps:0 With "maxcpus=1", no broken patch to force TSC: | 0.33 us, TSC-warps:679679972 | 3.30 us, TOD-warps:0 | 3.30 us, CLOCK-warps:0 With "maxcpus=1", including my broken patch to force use of TSC in this situation: | 0.05 us, TSC-warps:2884019968 | 0.45 us, TOD-warps:0 | 0.45 us, CLOCK-warps:0 For giggles, an older fedora kernel (2.6.23.1-42.fc8) gives: | 0.87 us, TSC-warps:575054334 | 8.67 us, TOD-warps:0 | 8.67 us, CLOCK-warps:0 -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nosmp/maxcpus=0 or 1 - TSC unstable
[EMAIL PROTECTED] wrote on Wed, 16 Jan 2008 16:31 +0100: * Pete Wyckoff [EMAIL PROTECTED] wrote: pretty sure this is the culprit is that num_possible_cpus() 1, which would mean cpu_possible_map contains the second cpu... but i'm not quite sure what the right fix is... or perhaps this is all intended. We've seen the same problem. We use gettimeofday() for timing of network-ish operations on the order of 10-50 us. But not having the TSC makes gettimeofday() itself very slow, on the order of 30 us. 30 usecs is too much - even with pmtimer it's typically below 5 usecs. Could you run this on your box: http://people.redhat.com/mingo/time-warp-test/time-warp-test.c and send back what it reports? (run it for a few minutes) You're right. That 30 us comes from an old comment. Testing with your code shows under 5 us as you expected. I had to hack out the #ifndef i386 error; compiling on x86-64. Dual-socket 2.4 GHz Opteron 250. TSC-warps grows continually in all situations. On 2.6.24-rc6 + scsi-misc + random stuff, 2 processors: | 0.39 us, TSC-warps:983002775 | 4.81 us, TOD-warps:0 | 4.81 us, CLOCK-warps:0 With maxcpus=1, no broken patch to force TSC: | 0.33 us, TSC-warps:679679972 | 3.30 us, TOD-warps:0 | 3.30 us, CLOCK-warps:0 With maxcpus=1, including my broken patch to force use of TSC in this situation: | 0.05 us, TSC-warps:2884019968 | 0.45 us, TOD-warps:0 | 0.45 us, CLOCK-warps:0 For giggles, an older fedora kernel (2.6.23.1-42.fc8) gives: | 0.87 us, TSC-warps:575054334 | 8.67 us, TOD-warps:0 | 8.67 us, CLOCK-warps:0 -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nosmp/maxcpus=0 or 1 -> TSC unstable
[EMAIL PROTECTED] wrote on Sat, 12 Jan 2008 11:48 -0800: > if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it > still disables TSC :) > > Marking TSC unstable due to TSCs unsynchronized > > this is an opteron 2xx box which does have two cpus and no clock-divide in > halt or cpufreq enabled so TSC should be fine with only one cpu. > > pretty sure this is the culprit is that num_possible_cpus() > 1, which > would mean cpu_possible_map contains the second cpu... but i'm not quite > sure what the right fix is... or perhaps this is all intended. We've seen the same problem. We use gettimeofday() for timing of network-ish operations on the order of 10-50 us. But not having the TSC makes gettimeofday() itself very slow, on the order of 30 us. Here's what we've been using for quite a few kernel versions. I've not tried to submit it for fear that it could break some other scenario, as you suggest. Although in hotplug scenarios, this function unsynchronized_tsc() should get rerun and disable TSC if more processors arrive. At least count this as a "me too". -- Pete >From 0cdcd494bc0e27f49438bc2fc72fd3823629802b Mon Sep 17 00:00:00 2001 From: Pete Wyckoff <[EMAIL PROTECTED]> Date: Tue, 15 Jan 2008 17:42:28 -0500 Subject: [PATCH] use tsc on 1 cpu smp Use num_online_cpus() instead of num_present_cpus() as the parameter to check when deciding if TSC is good enough. Thus explicitly booting with maxcpus=1 will let us use the TSC even on a dual-processor machine. This helps reduce gettimeofday overheads on our dual Opteron nodes immensely (30 us vs 0.5 us). Signed-off-by: Pete Wyckoff <[EMAIL PROTECTED]> --- arch/x86/kernel/tsc_64.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c index 9c70af4..5f2e91f 100644 --- a/arch/x86/kernel/tsc_64.c +++ b/arch/x86/kernel/tsc_64.c @@ -235,7 +235,7 @@ __cpuinit int unsynchronized_tsc(void) } /* Assume multi socket systems are not synchronized */ - return num_present_cpus() > 1; + return num_online_cpus() > 1; } int __init notsc_setup(char *s) -- 1.5.3.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nosmp/maxcpus=0 or 1 - TSC unstable
[EMAIL PROTECTED] wrote on Sat, 12 Jan 2008 11:48 -0800: if i boot an x86 64-bit 2.6.24-rc7 kernel with nosmp, maxcpus=0 or 1 it still disables TSC :) Marking TSC unstable due to TSCs unsynchronized this is an opteron 2xx box which does have two cpus and no clock-divide in halt or cpufreq enabled so TSC should be fine with only one cpu. pretty sure this is the culprit is that num_possible_cpus() 1, which would mean cpu_possible_map contains the second cpu... but i'm not quite sure what the right fix is... or perhaps this is all intended. We've seen the same problem. We use gettimeofday() for timing of network-ish operations on the order of 10-50 us. But not having the TSC makes gettimeofday() itself very slow, on the order of 30 us. Here's what we've been using for quite a few kernel versions. I've not tried to submit it for fear that it could break some other scenario, as you suggest. Although in hotplug scenarios, this function unsynchronized_tsc() should get rerun and disable TSC if more processors arrive. At least count this as a me too. -- Pete From 0cdcd494bc0e27f49438bc2fc72fd3823629802b Mon Sep 17 00:00:00 2001 From: Pete Wyckoff [EMAIL PROTECTED] Date: Tue, 15 Jan 2008 17:42:28 -0500 Subject: [PATCH] use tsc on 1 cpu smp Use num_online_cpus() instead of num_present_cpus() as the parameter to check when deciding if TSC is good enough. Thus explicitly booting with maxcpus=1 will let us use the TSC even on a dual-processor machine. This helps reduce gettimeofday overheads on our dual Opteron nodes immensely (30 us vs 0.5 us). Signed-off-by: Pete Wyckoff [EMAIL PROTECTED] --- arch/x86/kernel/tsc_64.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/tsc_64.c b/arch/x86/kernel/tsc_64.c index 9c70af4..5f2e91f 100644 --- a/arch/x86/kernel/tsc_64.c +++ b/arch/x86/kernel/tsc_64.c @@ -235,7 +235,7 @@ __cpuinit int unsynchronized_tsc(void) } /* Assume multi socket systems are not synchronized */ - return num_present_cpus() 1; + return num_online_cpus() 1; } int __init notsc_setup(char *s) -- 1.5.3.7 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Fri, 11 Jan 2008 19:16 -0500: > James Bottomley wrote: >> On Thu, 2008-01-10 at 16:46 -0500, Pete Wyckoff wrote: >>> [EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: >>>> On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: >>>>> [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: >>>>>> On Tue, 8 Jan 2008 17:09:18 -0500 >>>>>> Pete Wyckoff <[EMAIL PROTECTED]> wrote: >>>>>>> I took another look at the compat approach, to see if it is feasible >>>>>>> to keep the compat handling somewhere else, without the use of #ifdef >>>>>>> CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. >>>>>>> The use of iovec is within a write operation on a char device. It's >>>>>>> not amenable to a compat_sys_ or a .compat_ioctl approach. >>>>>>> >>>>>>> I'm partial to #1 because the use of architecture-independent fields >>>>>>> matches the rest of struct sg_io_v4. But if you don't want to have >>>>>>> another iovec type in the kernel, could we do #2 but just return >>>>>>> -EINVAL if the need for compat is detected? I.e. change >>>>>>> dout_iovec_count to dout_iovec_length and do the math? >>>>>> If you are ok with removing the write/read interface and just have >>>>>> ioctl, we could can handle comapt stuff like others do. But I think >>>>>> that you (OSD people) really want to keep the write/read >>>>>> interface. Sorry, I think that there is no workaround to support iovec >>>>>> in bsg. >>>>> I don't care about read/write in particular. But we do need some >>>>> way to launch asynchronous SCSI commands, and currently read/write >>>>> are the only way to do that in bsg. The reason is to keep multiple >>>>> spindles busy at the same time. >>>> Won't multi-threading the ioctl calls achieve the same effect? Or do >>>> you trip over BKL there? >>> There's no BKL on (new) ioctls anymore, at least. A thread per >>> device would be feasible perhaps. But if you want any sort of >>> pipelining out of the device, esp. in the remote iSCSI case, you >>> need to have a good number of commands outstanding to each device. >>> So a thread per command per device. Typical iSCSI queue depth of >>> 128 times 16 devices for a small setup is a lot of threads. >> >> I was actually thinking of a thread per outstanding command. >> >>> The pthread/pipe latency overhead is not insignificant for fast >>> storage networks too. >>> >>>>> How about these new ioctls instead of read/write: >>>>> >>>>> SG_IO_SUBMIT - start a new blk_execute_rq_nowait() >>>>> SG_IO_TEST - complete and return a previous req >>>>> SG_IO_WAIT - wait for a req to finish, interruptibly >>>>> >>>>> Then old write users will instead do ioctl SUBMIT. Read users will >>>>> do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could >>>>> be implemented as SUBMIT + WAIT. >>>>> >>>>> Then we can do compat_ioctl and convert up iovecs out-of-line before >>>>> calling the normal functions. >>>>> >>>>> Let me know if you want a patch for this. >>>> Really, the thought of re-inventing yet another async I/O interface >>>> isn't very appealing. >>> I'm fine with read/write, except Tomo is against handling iovecs >>> because of the compat complexity with struct iovec being different >>> on 32- vs 64-bit. There is a standard way to do "compat" ioctl that >>> hides this handling in a different file (not bsg.c), which is the >>> only reason I'm even considering these ioctls. I don't care about >>> compat setups per se. >>> >>> Is there another async I/O mechanism? Userspace builds the CDBs, >>> just needs some way to drop them in SCSI ML. BSG is almost perfect >>> for this, but doesn't do iovec, leading to lots of memcpy. >> >> No, it's just that async interfaces in Linux have a long and fairly >> unhappy history. > > The sg driver's async interface has been pretty stable for > a long time. The sync SG_IO ioctl is built on top of the > async interface. That makes the async interface extremely > well tested. > > The write()/read() async interface in s
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Fri, 11 Jan 2008 19:16 -0500: James Bottomley wrote: On Thu, 2008-01-10 at 16:46 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. Won't multi-threading the ioctl calls achieve the same effect? Or do you trip over BKL there? There's no BKL on (new) ioctls anymore, at least. A thread per device would be feasible perhaps. But if you want any sort of pipelining out of the device, esp. in the remote iSCSI case, you need to have a good number of commands outstanding to each device. So a thread per command per device. Typical iSCSI queue depth of 128 times 16 devices for a small setup is a lot of threads. I was actually thinking of a thread per outstanding command. The pthread/pipe latency overhead is not insignificant for fast storage networks too. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. Really, the thought of re-inventing yet another async I/O interface isn't very appealing. I'm fine with read/write, except Tomo is against handling iovecs because of the compat complexity with struct iovec being different on 32- vs 64-bit. There is a standard way to do compat ioctl that hides this handling in a different file (not bsg.c), which is the only reason I'm even considering these ioctls. I don't care about compat setups per se. Is there another async I/O mechanism? Userspace builds the CDBs, just needs some way to drop them in SCSI ML. BSG is almost perfect for this, but doesn't do iovec, leading to lots of memcpy. No, it's just that async interfaces in Linux have a long and fairly unhappy history. The sg driver's async interface has been pretty stable for a long time. The sync SG_IO ioctl is built on top of the async interface. That makes the async interface extremely well tested. The write()/read() async interface in sg does have one problem: when a command is dispatched via a write() it would be very useful to get back a tag but that violates write()'s second argument: 'const void * buf'. That tag could be useful both for identification of the response and by task management functions. I was hoping that the 'flags' field in sgv4 could be used to implement the variants: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly that way the existing SG_IO ioctl is sufficient. And if Tomo doesn't want to do it in the bsg driver, then it could be done it the sg driver. The sg driver already has async via read/write, and it works fine. Perhaps someone wants the ioctl versions too, but that's not my main goal here. I think that sg doesn't bother with compat iovec handling. It just uses SZ_SG_IOVEC (defined as sizeof(sg_iovec_t)) and doesn't check if the userspace iovec happens to be smaller. Sg does have a compat ioctl function; it just doesn't support SG_IO. So, on 64-bit kernel, read/write from 32-bit userspace with iovec will get undefined results due to the mis-interpretation of the iovec fields, while ioctl from 32-bit will fail with ENOIOCTLCMD
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: > On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: > > [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: > > > On Tue, 8 Jan 2008 17:09:18 -0500 > > > Pete Wyckoff <[EMAIL PROTECTED]> wrote: > > > > I took another look at the compat approach, to see if it is feasible > > > > to keep the compat handling somewhere else, without the use of #ifdef > > > > CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. > > > > The use of iovec is within a write operation on a char device. It's > > > > not amenable to a compat_sys_ or a .compat_ioctl approach. > > > > > > > > I'm partial to #1 because the use of architecture-independent fields > > > > matches the rest of struct sg_io_v4. But if you don't want to have > > > > another iovec type in the kernel, could we do #2 but just return > > > > -EINVAL if the need for compat is detected? I.e. change > > > > dout_iovec_count to dout_iovec_length and do the math? > > > > > > If you are ok with removing the write/read interface and just have > > > ioctl, we could can handle comapt stuff like others do. But I think > > > that you (OSD people) really want to keep the write/read > > > interface. Sorry, I think that there is no workaround to support iovec > > > in bsg. > > > > I don't care about read/write in particular. But we do need some > > way to launch asynchronous SCSI commands, and currently read/write > > are the only way to do that in bsg. The reason is to keep multiple > > spindles busy at the same time. > > Won't multi-threading the ioctl calls achieve the same effect? Or do > you trip over BKL there? There's no BKL on (new) ioctls anymore, at least. A thread per device would be feasible perhaps. But if you want any sort of pipelining out of the device, esp. in the remote iSCSI case, you need to have a good number of commands outstanding to each device. So a thread per command per device. Typical iSCSI queue depth of 128 times 16 devices for a small setup is a lot of threads. The pthread/pipe latency overhead is not insignificant for fast storage networks too. > > How about these new ioctls instead of read/write: > > > > SG_IO_SUBMIT - start a new blk_execute_rq_nowait() > > SG_IO_TEST - complete and return a previous req > > SG_IO_WAIT - wait for a req to finish, interruptibly > > > > Then old write users will instead do ioctl SUBMIT. Read users will > > do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could > > be implemented as SUBMIT + WAIT. > > > > Then we can do compat_ioctl and convert up iovecs out-of-line before > > calling the normal functions. > > > > Let me know if you want a patch for this. > > Really, the thought of re-inventing yet another async I/O interface > isn't very appealing. I'm fine with read/write, except Tomo is against handling iovecs because of the compat complexity with struct iovec being different on 32- vs 64-bit. There is a standard way to do "compat" ioctl that hides this handling in a different file (not bsg.c), which is the only reason I'm even considering these ioctls. I don't care about compat setups per se. Is there another async I/O mechanism? Userspace builds the CDBs, just needs some way to drop them in SCSI ML. BSG is almost perfect for this, but doesn't do iovec, leading to lots of memcpy. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: > On Tue, 8 Jan 2008 17:09:18 -0500 > Pete Wyckoff <[EMAIL PROTECTED]> wrote: > > I took another look at the compat approach, to see if it is feasible > > to keep the compat handling somewhere else, without the use of #ifdef > > CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. > > The use of iovec is within a write operation on a char device. It's > > not amenable to a compat_sys_ or a .compat_ioctl approach. > > > > I'm partial to #1 because the use of architecture-independent fields > > matches the rest of struct sg_io_v4. But if you don't want to have > > another iovec type in the kernel, could we do #2 but just return > > -EINVAL if the need for compat is detected? I.e. change > > dout_iovec_count to dout_iovec_length and do the math? > > If you are ok with removing the write/read interface and just have > ioctl, we could can handle comapt stuff like others do. But I think > that you (OSD people) really want to keep the write/read > interface. Sorry, I think that there is no workaround to support iovec > in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Thu, 10 Jan 2008 14:55 -0600: On Thu, 2008-01-10 at 15:43 -0500, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Wed, 09 Jan 2008 09:11 +0900: On Tue, 8 Jan 2008 17:09:18 -0500 Pete Wyckoff [EMAIL PROTECTED] wrote: I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? If you are ok with removing the write/read interface and just have ioctl, we could can handle comapt stuff like others do. But I think that you (OSD people) really want to keep the write/read interface. Sorry, I think that there is no workaround to support iovec in bsg. I don't care about read/write in particular. But we do need some way to launch asynchronous SCSI commands, and currently read/write are the only way to do that in bsg. The reason is to keep multiple spindles busy at the same time. Won't multi-threading the ioctl calls achieve the same effect? Or do you trip over BKL there? There's no BKL on (new) ioctls anymore, at least. A thread per device would be feasible perhaps. But if you want any sort of pipelining out of the device, esp. in the remote iSCSI case, you need to have a good number of commands outstanding to each device. So a thread per command per device. Typical iSCSI queue depth of 128 times 16 devices for a small setup is a lot of threads. The pthread/pipe latency overhead is not insignificant for fast storage networks too. How about these new ioctls instead of read/write: SG_IO_SUBMIT - start a new blk_execute_rq_nowait() SG_IO_TEST - complete and return a previous req SG_IO_WAIT - wait for a req to finish, interruptibly Then old write users will instead do ioctl SUBMIT. Read users will do TEST for non-blocking fd, or WAIT for blocking. And SG_IO could be implemented as SUBMIT + WAIT. Then we can do compat_ioctl and convert up iovecs out-of-line before calling the normal functions. Let me know if you want a patch for this. Really, the thought of re-inventing yet another async I/O interface isn't very appealing. I'm fine with read/write, except Tomo is against handling iovecs because of the compat complexity with struct iovec being different on 32- vs 64-bit. There is a standard way to do compat ioctl that hides this handling in a different file (not bsg.c), which is the only reason I'm even considering these ioctls. I don't care about compat setups per se. Is there another async I/O mechanism? Userspace builds the CDBs, just needs some way to drop them in SCSI ML. BSG is almost perfect for this, but doesn't do iovec, leading to lots of memcpy. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Sat, 05 Jan 2008 14:01 +0900: > From: Deepak Colluru <[EMAIL PROTECTED]> > Subject: [PATCH] bsg : Add support for io vectors in bsg > Date: Fri, 4 Jan 2008 21:47:34 +0530 (IST) > > > From: Deepak Colluru <[EMAIL PROTECTED]> > > > > Add support for io vectors in bsg. > > > > Signed-off-by: Deepak Colluru <[EMAIL PROTECTED]> > > --- > > bsg.c | 52 +--- > > 1 file changed, 49 insertions(+), 3 deletions(-) > > Thanks, but I have to NACK this. > > You can find the discussion about bsg io vector support and a similar > patch in linux-scsi archive. I have no plan to support it since it > needs the compat hack. You may recall this is one of the patches I need to use bsg with OSD devices. OSDs overload the SCSI buffer model to put mulitple fields in dataout and datain. Some is user data, but some is more logically created by a library. Memcpying in userspace to wedge all the segments into a single buffer is painful, and is required both on outgoing and incoming data buffers. There are two approaches to add iovec to bsg. 1. Define a new sg_iovec_v4 that uses constant width types. Both 32- and 64-bit userspace would hand arrays of this to the kernel. struct sg_v4_iovec { __u64 iov_base; __u32 iov_len; __u32 __pad1; }; Old patch here: http://article.gmane.org/gmane.linux.scsi/30461/ 2. Do as Deepak has done, using the existing sg_iovec, but then also work around the compat issue. Old v3 sg_iovec is: typedef struct sg_iovec /* same structure as used by readv() Linux system */ { /* call. It defines one scatter-gather element. */ void __user *iov_base; /* Starting address */ size_t iov_len; /* Length in bytes */ } sg_iovec_t; Old patch here: http://article.gmane.org/gmane.linux.scsi/30460/ I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] bsg : Add support for io vectors in bsg
[EMAIL PROTECTED] wrote on Sat, 05 Jan 2008 14:01 +0900: From: Deepak Colluru [EMAIL PROTECTED] Subject: [PATCH] bsg : Add support for io vectors in bsg Date: Fri, 4 Jan 2008 21:47:34 +0530 (IST) From: Deepak Colluru [EMAIL PROTECTED] Add support for io vectors in bsg. Signed-off-by: Deepak Colluru [EMAIL PROTECTED] --- bsg.c | 52 +--- 1 file changed, 49 insertions(+), 3 deletions(-) Thanks, but I have to NACK this. You can find the discussion about bsg io vector support and a similar patch in linux-scsi archive. I have no plan to support it since it needs the compat hack. You may recall this is one of the patches I need to use bsg with OSD devices. OSDs overload the SCSI buffer model to put mulitple fields in dataout and datain. Some is user data, but some is more logically created by a library. Memcpying in userspace to wedge all the segments into a single buffer is painful, and is required both on outgoing and incoming data buffers. There are two approaches to add iovec to bsg. 1. Define a new sg_iovec_v4 that uses constant width types. Both 32- and 64-bit userspace would hand arrays of this to the kernel. struct sg_v4_iovec { __u64 iov_base; __u32 iov_len; __u32 __pad1; }; Old patch here: http://article.gmane.org/gmane.linux.scsi/30461/ 2. Do as Deepak has done, using the existing sg_iovec, but then also work around the compat issue. Old v3 sg_iovec is: typedef struct sg_iovec /* same structure as used by readv() Linux system */ { /* call. It defines one scatter-gather element. */ void __user *iov_base; /* Starting address */ size_t iov_len; /* Length in bytes */ } sg_iovec_t; Old patch here: http://article.gmane.org/gmane.linux.scsi/30460/ I took another look at the compat approach, to see if it is feasible to keep the compat handling somewhere else, without the use of #ifdef CONFIG_COMPAT and size-comparison code inside bsg.c. I don't see how. The use of iovec is within a write operation on a char device. It's not amenable to a compat_sys_ or a .compat_ioctl approach. I'm partial to #1 because the use of architecture-independent fields matches the rest of struct sg_io_v4. But if you don't want to have another iovec type in the kernel, could we do #2 but just return -EINVAL if the need for compat is detected? I.e. change dout_iovec_count to dout_iovec_length and do the math? -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: list corruption on ib_srp load in v2.6.24-rc5
[EMAIL PROTECTED] wrote on Fri, 21 Dec 2007 17:18 -0500: > On Fri, 2007-12-21 at 16:52 -0500, David Dillow wrote: > > I'm getting the following oops when doing the following commands: > > > > modprobe ib_srp > > > > rmmod ib_srp > > modprobe ib_srp > > > > > > I'm going to try and track down how the list is getting corrupted; it > > looks like attribute_container_list in > > drivers/base/attribute_container.c is the one getting corrupted. > > Ok, found the culprit, now to figure out the motive and fix it. > > ib_srp's srp_cleanup_module calls srp_release_transport(), which calls > transport_container_unregister() for the rport_attr_cont member of > struct srp_internal. > > That last unregister call is returning -EBUSY, but it gets ignored, and > the list node gets erased (or just reused) when the module's text/memory > is free'd. > > Now, to see if ib_srp should be waiting for everything to be destroyed > before calling srp_release_transport(), or if it is just not removing > some attributes properly. I don't see where srp_cleanup_module() is calling srp_remove_host(). That is the likely way that transport devices should be made to go away. Something on the order of srp_remove_work(). Or srp_remove_one() except with a call to srp_remove_host() may be necessary. In fact, maybe just adding that call will fix it, as ib_unregister_client should drive the remove function. Guesses, all this. -- Pete -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ofa-general] Re: list corruption on ib_srp load in v2.6.24-rc5
[EMAIL PROTECTED] wrote on Fri, 21 Dec 2007 17:18 -0500: On Fri, 2007-12-21 at 16:52 -0500, David Dillow wrote: I'm getting the following oops when doing the following commands: modprobe ib_srp add targets(s) to ib_srp using sysfs rmmod ib_srp modprobe ib_srp OOPS I'm going to try and track down how the list is getting corrupted; it looks like attribute_container_list in drivers/base/attribute_container.c is the one getting corrupted. Ok, found the culprit, now to figure out the motive and fix it. ib_srp's srp_cleanup_module calls srp_release_transport(), which calls transport_container_unregister() for the rport_attr_cont member of struct srp_internal. That last unregister call is returning -EBUSY, but it gets ignored, and the list node gets erased (or just reused) when the module's text/memory is free'd. Now, to see if ib_srp should be waiting for everything to be destroyed before calling srp_release_transport(), or if it is just not removing some attributes properly. I don't see where srp_cleanup_module() is calling srp_remove_host(). That is the likely way that transport devices should be made to go away. Something on the order of srp_remove_work(). Or srp_remove_one() except with a call to srp_remove_host() may be necessary. In fact, maybe just adding that call will fix it, as ib_unregister_client should drive the remove function. Guesses, all this. -- Pete -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: getrusage vs /proc/pid/stat?
[EMAIL PROTECTED] said: > I'd like to monitor CPU, memory, and I/O utilization in a > long-running multithreaded daemon under kernels 2.2, 2.4, > and possibly also Solaris (#ifdefs are ok). > > getrusage() looked promising, and might even work for CPU utilization. > Dunno if it returns info for all child threads yet, haven't tried it. > In Linux, though, getrusage() doesn't return any info about RAM. > > I know I can get the RSS and VSIZE under Linux by parsing /proc/pid/stat, > but was hoping for a faster interface (although I suppose a seek, > a read, and an ascii parse isn't *that* slow). Is /proc/pid/stat > the only way to go under Linux to monitor RSS? getrusage() isn't really the system call you want for this. There is a max RSS value, which linux could support but doesn't, but you seem to want to see the current RSS over time. Search the archive for various patches/complaints about getrusage. For vsize, most OSes offer time-integral averages of text, data, and stack sizes via getrusage(). Again, more of an aggregate than a current snapshot, and again, linux returns zero for these currently. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [docPATCH] mm.h documentation
[EMAIL PROTECTED] said: > typedef struct page { [..] > + unsigned long index;/* Our offset within mapping. */ [..] > + * A page may belong to an inode's memory mapping. In this case, > + * page->mapping is the pointer to the inode, and page->offset is the > + * file offset of the page (not necessarily a multiple of PAGE_SIZE). Minor nit. The field offset was renamed to index some time ago, but I can't figure out if the usage changed. Can you fix the comment and educate us? -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [docPATCH] mm.h documentation
[EMAIL PROTECTED] said: typedef struct page { [..] + unsigned long index;/* Our offset within mapping. */ [..] + * A page may belong to an inode's memory mapping. In this case, + * page-mapping is the pointer to the inode, and page-offset is the + * file offset of the page (not necessarily a multiple of PAGE_SIZE). Minor nit. The field offset was renamed to index some time ago, but I can't figure out if the usage changed. Can you fix the comment and educate us? -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: getrusage vs /proc/pid/stat?
[EMAIL PROTECTED] said: I'd like to monitor CPU, memory, and I/O utilization in a long-running multithreaded daemon under kernels 2.2, 2.4, and possibly also Solaris (#ifdefs are ok). getrusage() looked promising, and might even work for CPU utilization. Dunno if it returns info for all child threads yet, haven't tried it. In Linux, though, getrusage() doesn't return any info about RAM. I know I can get the RSS and VSIZE under Linux by parsing /proc/pid/stat, but was hoping for a faster interface (although I suppose a seek, a read, and an ascii parse isn't *that* slow). Is /proc/pid/stat the only way to go under Linux to monitor RSS? getrusage() isn't really the system call you want for this. There is a max RSS value, which linux could support but doesn't, but you seem to want to see the current RSS over time. Search the archive for various patches/complaints about getrusage. For vsize, most OSes offer time-integral averages of text, data, and stack sizes via getrusage(). Again, more of an aggregate than a current snapshot, and again, linux returns zero for these currently. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3com Driver and the 3XP Processor
[EMAIL PROTECTED] said: > [EMAIL PROTECTED] writes: > > So are there any intresting changes one can make to the acenic? > > Like I said, there is an entire firmware developer kit, so the only > limit is your imagination and coding skills :-) I wrote a new firmware from scratch to offload most of a message passing implementation (MPI, actually) into the Alteon NIC, including timeout and retransmit, message matching, header building, etc. It's almost ready for release. We're currently working on using both processors of the Tigon in parallel. There is other work which has modified the Alteon firmware to do, for example, segmentation and reassembly (Underwood et al.), and further work in progress on other topics. Programmable NICs are fun. I'd like to get one of those Tigon3 cards to see if any of the register layout on the Tigon2 is still there, hoping the interface is similar at least. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3com Driver and the 3XP Processor
[EMAIL PROTECTED] said: [EMAIL PROTECTED] writes: So are there any intresting changes one can make to the acenic? Like I said, there is an entire firmware developer kit, so the only limit is your imagination and coding skills :-) I wrote a new firmware from scratch to offload most of a message passing implementation (MPI, actually) into the Alteon NIC, including timeout and retransmit, message matching, header building, etc. It's almost ready for release. We're currently working on using both processors of the Tigon in parallel. There is other work which has modified the Alteon firmware to do, for example, segmentation and reassembly (Underwood et al.), and further work in progress on other topics. Programmable NICs are fun. I'd like to get one of those Tigon3 cards to see if any of the register layout on the Tigon2 is still there, hoping the interface is similar at least. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forcibly unmap pages in driver?
[EMAIL PROTECTED] said: > My driver uses a variable-size DMA buffer that it shares with user-space; I > provide an ioctl() to choose the buffer size and allocate the buffer. Say > the user program chooses a large buffer size, and mmap()s the entire buffer. > Later, the program calls the ioctl() again to set a smaller buffer size, or > closes the file descriptor. At this point I'd like to shrink the buffer or > free it completely. But I can't assume that the program will be nice and > munmap() the region for me - it might still have the large buffer mapped. > What should I do here? > > An easy solution would to allocate the largest possible buffer as my driver > is loaded, even if not all of it will be exposed to user-space. I don't > really like this choice because the buffer needs to be pinned in memory, and > the largest useful buffer size is very big (several tens of MB). Maybe I > should disallow more than one buffer allocation per open() of the device... > But the memory mapping will stay around even after close(), correct? I'd > hate to have to keep the buffer around until my driver module is unloaded. I see. Your explanation makes sense, and close won't affect the mmap (unless you explicitly trigger it in the driver, which you should). Other drivers deal with this; you may want to go grepping for do_munmap and see how they handle it. [ > [EMAIL PROTECTED] said: ] > > However, do_munmap() will call zap_page_range() for you and take care of > > cache and TLB flushing if you're going to do this in the kernel. > > I'm not sure if I could use do_munmap() -- how will I know if the user > program has called munmap() already, and then mmap()ed something else in the > same place? Then I'd be killing the wrong mapping... Look at drivers/char/drm, for example. At mmap time they allocate a vm_ops to the address space. With that you catch changes to the vma structure initiated by a user mmap, munmap, etc. You could also dynamically map the pages in using the nopage method (optional). The less elegant way of doing this is to put in your userspace API some conditions which say: if you do the following: open(fd); ioctl(fd, give_me_the_buf); munmap(some_or_all_of_the_buf); buf2 = mmap(...); close(fd); /* or ioctl(fd, shrink_the_buf); */ buf2[27] = 3; you will be killed with a sigbus. I.e. don't manually munmap the region. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forcibly unmap pages in driver?
[EMAIL PROTECTED] said: > I am writing a device driver that, like many others, exposes a shared memory > region to user-space via mmap(). The region is allocated with vmalloc(), the > pages are marked reserved, and the user-space mapping is implemented with > remap_page_range(). > > In my driver, I may have to free the underlying vmalloc() region while the > user-space program is still running. I need to remove the user-space > mapping -- otherwise the user process would still have access to the > now-freed pages. I need an inverse of remap_page_range(). > > Is zap_page_range() the function I am looking for? Unfortunately it's not > exported to modules =(. As a quick fix, I was thinking I could just remap > all of the user pages to point to a zeroed page or something... > > Another question- in the mm.c sources, I see that many of the memory-mapping > functions are surrounded by calls to flush_cache_range() and > flush_tlb_range(). But I don't see these calls in many drivers. Is it > necessary to make them when my driver maps or unmaps the shared memory > region? That seems a bit perverse. How will the poor userspace program know not to access the pages you have yanked away from it? If you plan to kill it, better to do that directly. If you plan to signal it that the mapping is gone, it can just call munmap() itself. However, do_munmap() will call zap_page_range() for you and take care of cache and TLB flushing if you're going to do this in the kernel. Your driver mmap function is called by do_mmap_pgoff() which takes care of those issues, and there is no (*munmap) in file_operations--- perhaps you are the first driver writer to want to unmap in the kernel. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forcibly unmap pages in driver?
[EMAIL PROTECTED] said: I am writing a device driver that, like many others, exposes a shared memory region to user-space via mmap(). The region is allocated with vmalloc(), the pages are marked reserved, and the user-space mapping is implemented with remap_page_range(). In my driver, I may have to free the underlying vmalloc() region while the user-space program is still running. I need to remove the user-space mapping -- otherwise the user process would still have access to the now-freed pages. I need an inverse of remap_page_range(). Is zap_page_range() the function I am looking for? Unfortunately it's not exported to modules =(. As a quick fix, I was thinking I could just remap all of the user pages to point to a zeroed page or something... Another question- in the mm.c sources, I see that many of the memory-mapping functions are surrounded by calls to flush_cache_range() and flush_tlb_range(). But I don't see these calls in many drivers. Is it necessary to make them when my driver maps or unmaps the shared memory region? That seems a bit perverse. How will the poor userspace program know not to access the pages you have yanked away from it? If you plan to kill it, better to do that directly. If you plan to signal it that the mapping is gone, it can just call munmap() itself. However, do_munmap() will call zap_page_range() for you and take care of cache and TLB flushing if you're going to do this in the kernel. Your driver mmap function is called by do_mmap_pgoff() which takes care of those issues, and there is no (*munmap) in file_operations--- perhaps you are the first driver writer to want to unmap in the kernel. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: forcibly unmap pages in driver?
[EMAIL PROTECTED] said: My driver uses a variable-size DMA buffer that it shares with user-space; I provide an ioctl() to choose the buffer size and allocate the buffer. Say the user program chooses a large buffer size, and mmap()s the entire buffer. Later, the program calls the ioctl() again to set a smaller buffer size, or closes the file descriptor. At this point I'd like to shrink the buffer or free it completely. But I can't assume that the program will be nice and munmap() the region for me - it might still have the large buffer mapped. What should I do here? An easy solution would to allocate the largest possible buffer as my driver is loaded, even if not all of it will be exposed to user-space. I don't really like this choice because the buffer needs to be pinned in memory, and the largest useful buffer size is very big (several tens of MB). Maybe I should disallow more than one buffer allocation per open() of the device... But the memory mapping will stay around even after close(), correct? I'd hate to have to keep the buffer around until my driver module is unloaded. I see. Your explanation makes sense, and close won't affect the mmap (unless you explicitly trigger it in the driver, which you should). Other drivers deal with this; you may want to go grepping for do_munmap and see how they handle it. [ [EMAIL PROTECTED] said: ] However, do_munmap() will call zap_page_range() for you and take care of cache and TLB flushing if you're going to do this in the kernel. I'm not sure if I could use do_munmap() -- how will I know if the user program has called munmap() already, and then mmap()ed something else in the same place? Then I'd be killing the wrong mapping... Look at drivers/char/drm, for example. At mmap time they allocate a vm_ops to the address space. With that you catch changes to the vma structure initiated by a user mmap, munmap, etc. You could also dynamically map the pages in using the nopage method (optional). The less elegant way of doing this is to put in your userspace API some conditions which say: if you do the following: open(fd); ioctl(fd, give_me_the_buf); munmap(some_or_all_of_the_buf); buf2 = mmap(...); close(fd); /* or ioctl(fd, shrink_the_buf); */ buf2[27] = 3; you will be killed with a sigbus. I.e. don't manually munmap the region. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Makefile patch for cscope and saner Ctags
[EMAIL PROTECTED] said: > The following patch generates saner Ctags and will build cscope > output. It's against 2.4.5 > > --- Makefile.old Mon May 28 22:44:01 2001 > +++ Makefile Wed May 30 17:50:01 2001 > @@ -334,11 +334,32 @@ > > # Exuberant ctags works better with -I > tags: dummy > - CTAGSF=`ctags --version | grep -i exuberant >/dev/null && echo "-I >__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS"`; \ > + CTAGSF=`ctags --version | grep -i exuberant >/dev/null && echo "--sort=no -I >__initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS"`; \ > ctags $$CTAGSF `find include/asm-$(ARCH) -name '*.h'` && \ > - find include -type d \( -name "asm-*" -o -name config \) -prune -o -name '*.h' >-print | xargs ctags $$CTAGSF -a && \ > + find include -type f -name '*.h' -mindepth 2 -maxdepth 2 \ > + | grep -v include/asm- | grep -v include/config \ > + | xargs -r ctags $$CTAGSF -a && \ > + find include -type f -name '*.h' -mindepth 3 -maxdepth 3 \ > + | grep -v include/asm- | grep -v include/config \ > + | xargs -r ctags $$CTAGSF -a && \ > + find include -type f -name '*.h' -mindepth 4 -maxdepth 4 \ > + | grep -v include/asm- | grep -v include/config \ > + | xargs -r ctags $$CTAGSF -a && \ > + find include -type f -name '*.h' -mindepth 5 -maxdepth 5 \ > + | grep -v include/asm- | grep -v include/config \ > + | xargs -r ctags $$CTAGSF -a && \ > find $(SUBDIRS) init -name '*.c' | xargs ctags $$CTAGSF -a > + mv tags tags.unsorted > + LC_ALL=C sort -k 1,1 -s tags.unsorted > tags > + rm tags.unsorted > > +cscope: dummy > + find include/asm-$(ARCH) -name '*.h' >cscope.files > + find include $(SUBDIRS) init -type f -name '*.[ch]' \ > + | grep -v include/asm- | grep -v include/config >> cscope.files > + cscope -b -I include > + > + > ifdef CONFIG_MODULES > ifdef CONFIG_MODVERSIONS > MODFLAGS += -DMODVERSIONS -include $(HPATH)/linux/modversions.h You seem not to have read my response to your earlier mail proprosing such a thing (for tags only, not cscope): http://boudicca.tux.org/hypermail/linux-kernel/2001week21/1869.html How does the patch above fix anything? You're sorting so that include/linux/*.h comes before include/linux/{mtd,lockd,raid,...}/*.h, but I don't see how that can be an improvement, or how it addresses your original complaint "ctags doesn't honour any CPP #if'ing". -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Makefile patch for cscope and saner Ctags
[EMAIL PROTECTED] said: The following patch generates saner Ctags and will build cscope output. It's against 2.4.5 --- Makefile.old Mon May 28 22:44:01 2001 +++ Makefile Wed May 30 17:50:01 2001 @@ -334,11 +334,32 @@ # Exuberant ctags works better with -I tags: dummy - CTAGSF=`ctags --version | grep -i exuberant /dev/null echo -I __initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS`; \ + CTAGSF=`ctags --version | grep -i exuberant /dev/null echo --sort=no -I __initdata,__exitdata,EXPORT_SYMBOL,EXPORT_SYMBOL_NOVERS`; \ ctags $$CTAGSF `find include/asm-$(ARCH) -name '*.h'` \ - find include -type d \( -name asm-* -o -name config \) -prune -o -name '*.h' -print | xargs ctags $$CTAGSF -a \ + find include -type f -name '*.h' -mindepth 2 -maxdepth 2 \ + | grep -v include/asm- | grep -v include/config \ + | xargs -r ctags $$CTAGSF -a \ + find include -type f -name '*.h' -mindepth 3 -maxdepth 3 \ + | grep -v include/asm- | grep -v include/config \ + | xargs -r ctags $$CTAGSF -a \ + find include -type f -name '*.h' -mindepth 4 -maxdepth 4 \ + | grep -v include/asm- | grep -v include/config \ + | xargs -r ctags $$CTAGSF -a \ + find include -type f -name '*.h' -mindepth 5 -maxdepth 5 \ + | grep -v include/asm- | grep -v include/config \ + | xargs -r ctags $$CTAGSF -a \ find $(SUBDIRS) init -name '*.c' | xargs ctags $$CTAGSF -a + mv tags tags.unsorted + LC_ALL=C sort -k 1,1 -s tags.unsorted tags + rm tags.unsorted +cscope: dummy + find include/asm-$(ARCH) -name '*.h' cscope.files + find include $(SUBDIRS) init -type f -name '*.[ch]' \ + | grep -v include/asm- | grep -v include/config cscope.files + cscope -b -I include + + ifdef CONFIG_MODULES ifdef CONFIG_MODVERSIONS MODFLAGS += -DMODVERSIONS -include $(HPATH)/linux/modversions.h You seem not to have read my response to your earlier mail proprosing such a thing (for tags only, not cscope): http://boudicca.tux.org/hypermail/linux-kernel/2001week21/1869.html How does the patch above fix anything? You're sorting so that include/linux/*.h comes before include/linux/{mtd,lockd,raid,...}/*.h, but I don't see how that can be an improvement, or how it addresses your original complaint ctags doesn't honour any CPP #if'ing. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pte_page
[EMAIL PROTECTED] said: > On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: > > I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page' > > inside a module to get the physical address of a user space virtual > > address. The physical address returned by 'pte_page' is not page > > aligned whereas the virtual address was page aligned. Can somebody > > tell me the reason? > > __pa(page_address(pte_page(pte))) is the address you want. [or > pte_val(*pte) & (PAGE_SIZE-1) on x86 but this is platform-dependent.] Does this work on x86 non-kmapped highmem user pages too? (i.e. is page->virtual valid for every potential user page.) -- Pete > > Also, can i use these functions to get the physical address of a > > kernel virtual address using init_mm? > > nope. Eg. on x86 these functions only walk normal 4K page pagetables, they > do not walk 4MB pages correctly. (which are set up on pentiums and better > CPUs, unless mem=nopentium is specified.) > > a kernel virtual address can be decoded by simply doing __pa(kaddr). If > the page is a highmem page [and you have the struct page pointer] then you > can do [(page-mem_map) << PAGE_SHIFT] to get the physical address, but > only on systems where mem_map[] starts at physical address 0. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pte_page
[EMAIL PROTECTED] said: On Wed, 30 May 2001 [EMAIL PROTECTED] wrote: I use the 'pgt_offset', 'pmd_offset', 'pte_offset' and 'pte_page' inside a module to get the physical address of a user space virtual address. The physical address returned by 'pte_page' is not page aligned whereas the virtual address was page aligned. Can somebody tell me the reason? __pa(page_address(pte_page(pte))) is the address you want. [or pte_val(*pte) (PAGE_SIZE-1) on x86 but this is platform-dependent.] Does this work on x86 non-kmapped highmem user pages too? (i.e. is page-virtual valid for every potential user page.) -- Pete Also, can i use these functions to get the physical address of a kernel virtual address using init_mm? nope. Eg. on x86 these functions only walk normal 4K page pagetables, they do not walk 4MB pages correctly. (which are set up on pentiums and better CPUs, unless mem=nopentium is specified.) a kernel virtual address can be decoded by simply doing __pa(kaddr). If the page is a highmem page [and you have the struct page pointer] then you can do [(page-mem_map) PAGE_SHIFT] to get the physical address, but only on systems where mem_map[] starts at physical address 0. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] minor lockd debugging typo
Rather than print proc twice, you probably intend to show the prog. (At least, I had wanted to see the prog in the debugging output. :) ) -- Pete diff -X dontdiff -ruN linux-2.4.5-kdb/fs/lockd/mon.c linux-2.4.5/fs/lockd/mon.c --- linux-2.4.5-kdb/fs/lockd/mon.c Mon Oct 16 15:58:51 2000 +++ linux-2.4.5/fs/lockd/mon.c Tue May 29 15:52:17 2001 @@ -146,7 +146,7 @@ u32 addr = ntohl(argp->addr); dprintk("nsm: xdr_encode_mon(%08x, %d, %d, %d)\n", - htonl(argp->addr), htonl(argp->proc), + htonl(argp->addr), htonl(argp->prog), htonl(argp->vers), htonl(argp->proc)); /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch] minor lockd debugging typo
Rather than print proc twice, you probably intend to show the prog. (At least, I had wanted to see the prog in the debugging output. :) ) -- Pete diff -X dontdiff -ruN linux-2.4.5-kdb/fs/lockd/mon.c linux-2.4.5/fs/lockd/mon.c --- linux-2.4.5-kdb/fs/lockd/mon.c Mon Oct 16 15:58:51 2000 +++ linux-2.4.5/fs/lockd/mon.c Tue May 29 15:52:17 2001 @@ -146,7 +146,7 @@ u32 addr = ntohl(argp-addr); dprintk(nsm: xdr_encode_mon(%08x, %d, %d, %d)\n, - htonl(argp-addr), htonl(argp-proc), + htonl(argp-addr), htonl(argp-prog), htonl(argp-vers), htonl(argp-proc)); /* - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dget()
[EMAIL PROTECTED] said: > # cd /usr/src/linux > # find -name '*.[ch]' | ctags -L- & > > On 15 May 2001, Xavier Bestel wrote: > > # cd /usr/src/linux > > # make tags > > No, I never use that one because it skips very useful entries like the > ones from EXPORT_SYMBOL etc. Also, it only shows the current architecture. > So, the tags target in the Makefile would only become useful when it is > stripped of extra (unnecessary, imho) logic and turned into a plain one I > suggested above. You can use the -I feature to ignore some keywords, but not the functions they modify. Without ignoring __initdata, for instance, you get a couple hundred tags for it, and none for the actual variable, e.g. cpu_vendor_names. My current ignore list for the kernel is at http://www.osc.edu/~pw/ctags-ignore . If you really want to see the EXPORT_SYMBOL(tag) lines, you'll want to remove that one from the list (and be sure to do the sorting...), Next, the wonderful editor Vim doesn't read my mind quite well enough. When I want to see the tag for "page", for example, I really want to look at the definition of the struct in include/linux/mm.h, not at any of the 16 other places which declare a variable struct page *page;. So I run the tags file through a perl script which percolates those interesting items to the top when there are multiple entries for the same identifier. It's here: http://www.osc.edu/~pw/sort-tags . Finally, I modify the Makefile to generate tags with these modifications. Here's the snippet: tags: dummy ( find include/asm-$(ARCH) -name "*.h" ;\ find include -type d \( -name "asm-*" -o -name config \) -prune \ -o -name "*.h" -print ; find $(SUBDIRS) init -name '*.[chS]' ) |\ ctags -I../ctags-ignore -L - -f - | sort-tags > tags Note that this only generates tags for your current architecture, and that your ctags must be Exuberant and version >= 5.0. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: dget()
[EMAIL PROTECTED] said: # cd /usr/src/linux # find -name '*.[ch]' | ctags -L- On 15 May 2001, Xavier Bestel wrote: # cd /usr/src/linux # make tags No, I never use that one because it skips very useful entries like the ones from EXPORT_SYMBOL etc. Also, it only shows the current architecture. So, the tags target in the Makefile would only become useful when it is stripped of extra (unnecessary, imho) logic and turned into a plain one I suggested above. You can use the -I feature to ignore some keywords, but not the functions they modify. Without ignoring __initdata, for instance, you get a couple hundred tags for it, and none for the actual variable, e.g. cpu_vendor_names. My current ignore list for the kernel is at http://www.osc.edu/~pw/ctags-ignore . If you really want to see the EXPORT_SYMBOL(tag) lines, you'll want to remove that one from the list (and be sure to do the sorting...), Next, the wonderful editor Vim doesn't read my mind quite well enough. When I want to see the tag for page, for example, I really want to look at the definition of the struct in include/linux/mm.h, not at any of the 16 other places which declare a variable struct page *page;. So I run the tags file through a perl script which percolates those interesting items to the top when there are multiple entries for the same identifier. It's here: http://www.osc.edu/~pw/sort-tags . Finally, I modify the Makefile to generate tags with these modifications. Here's the snippet: tags: dummy ( find include/asm-$(ARCH) -name *.h ;\ find include -type d \( -name asm-* -o -name config \) -prune \ -o -name *.h -print ; find $(SUBDIRS) init -name '*.[chS]' ) |\ ctags -I../ctags-ignore -L - -f - | sort-tags tags Note that this only generates tags for your current architecture, and that your ctags must be Exuberant and version = 5.0. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Direct Sockets Support??
[EMAIL PROTECTED] said: > > But in the case of an application which fits in main memory, and > > has been running for a while (so all pages are present and > > dirty), all you'd really have to do is verify the page tables are > > in the proper state and skip the TLB flush, right? > > We really cannot assume this. There are two cases > a. when a user app wants to receive some data, it allocates > memory(using malloc) and waits for the hw to do zero-copy read. The kernel > does not allocate physical page frames for the entire memory region > allocated. We need to lock the memory (and locking is expensive due to > costly TLB flushes) to do this > > b. when a user app wants to send data, he fills the buffer > and waits for the hw to transmit data, but under heavy physical memory > pressure, the swapper might swap the pages we want to transmit. So we need > to lock the memory to be 100% sure. You're right, of course. But I suspect that the fast path of re-locking memory which is happily in core will go much faster by removing the multi-processor TLB purge. And it can't hurt, unless I'm missing something. -- Pete --- linux-2.4.4-stock/mm/mlock.cTue May 8 17:26:34 2001 +++ linux/mm/mlock.cTue May 8 17:24:13 2001 @@ -114,6 +114,10 @@ return 0; } +/* implemented in mm/memory.c */ +extern int mlock_make_pages_present(struct vm_area_struct *vma, + unsigned long addr, unsigned long end); + static int mlock_fixup(struct vm_area_struct * vma, unsigned long start, unsigned long end, unsigned int newflags) { @@ -138,7 +142,7 @@ pages = (end - start) >> PAGE_SHIFT; if (newflags & VM_LOCKED) { pages = -pages; - make_pages_present(start, end); + mlock_make_pages_present(vma, start, end); } vma->vm_mm->locked_vm -= pages; } --- linux-2.4.4-stock/mm/memory.c Tue May 8 17:25:36 2001 +++ linux/mm/memory.c Tue May 8 17:24:40 2001 @@ -1438,3 +1438,80 @@ } while (addr < end); return 0; } + +/* + * Specialized version of make_pages_present which does not require + * a multi-processor TLB purge for every page if nothing about the PTE + * was modified. + */ +int mlock_make_pages_present(struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + int ret, write; + struct mm_struct *mm = current->mm; + + write = (vma->vm_flags & VM_WRITE) != 0; + + /* +* We need the page table lock to synchronize with kswapd +* and the SMP-safe atomic PTE updates. +*/ + spin_lock(>page_table_lock); + + ret = 0; + for (ret=0; !ret && addr < end; addr += PAGE_SIZE) { + pgd_t *pgd; + pmd_t *pmd; + pte_t *pte, entry; + int modified; + + current->state = TASK_RUNNING; + pgd = pgd_offset(mm, addr); + pmd = pmd_alloc(mm, pgd, addr); + if (!pmd) { + ret = -1; + break; + } + pte = pte_alloc(mm, pmd, addr); + if (!pte) { + ret = -1; + break; + } + entry = *pte; + if (!pte_present(entry)) { + /* +* If it truly wasn't present, we know that kswapd +* and the PTE updates will not touch it later. So +* drop the lock. +*/ + if (pte_none(entry)) { + ret = do_no_page(mm, vma, addr, write, pte); + continue; + } + ret = do_swap_page(mm, vma, addr, pte, + pte_to_swp_entry(entry), write); + continue; + } + + modified = 0; + if (write) { + if (!pte_write(entry)) { + ret = do_wp_page(mm, vma, addr, pte, entry); + continue; + } + if (!pte_dirty(entry)) { + entry = pte_mkdirty(entry); + modified = 1; + } + } + if (!pte_young(entry)) { + entry = pte_mkyoung(entry); + modified = 1; + } + if (modified) + establish_pte(vma, addr, pte, entry); + } + + spin_unlock(>page_table_lock); + return ret; +} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
Re: [RFC] Direct Sockets Support??
[EMAIL PROTECTED] said: > > A couple of concerns I have: > > * How to pin or pagelock the application buffer without > > making a kernel transition. > > You need to pin them in advance. And pinning pages is _expensive_ so you dont > want to keep pinning/unpinning pages I can't convince myself why this has to be so expensive. The current implementation does this for mlock: 1. Split vma if only a subset of the pages are being locked. 2. Mark bit in vma. 3. Make sure the pages are in core. That third step has the potential of being the most expensive, as changing the page tables requires invalidating the TLBs on all processors. Currently make_pages_present() does the work for 3. But in the case of an application which fits in main memory, and has been running for a while (so all pages are present and dirty), all you'd really have to do is verify the page tables are in the proper state and skip the TLB flush, right? Then 3 turns into a single spin_lock pair for the page_table_lock, and walking down the page table. The VMA splitting can be nasty, as it might require a couple of slab allocations, and doing an AVL insertion. (More nastiness in the case of shared memory or file mapping, too.) But nothing like playing with TLBs. Any reason why make_pages_present() is not the really oversized hammer it seems to be? -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Direct Sockets Support??
[EMAIL PROTECTED] said: A couple of concerns I have: * How to pin or pagelock the application buffer without making a kernel transition. You need to pin them in advance. And pinning pages is _expensive_ so you dont want to keep pinning/unpinning pages I can't convince myself why this has to be so expensive. The current implementation does this for mlock: 1. Split vma if only a subset of the pages are being locked. 2. Mark bit in vma. 3. Make sure the pages are in core. That third step has the potential of being the most expensive, as changing the page tables requires invalidating the TLBs on all processors. Currently make_pages_present() does the work for 3. But in the case of an application which fits in main memory, and has been running for a while (so all pages are present and dirty), all you'd really have to do is verify the page tables are in the proper state and skip the TLB flush, right? Then 3 turns into a single spin_lock pair for the page_table_lock, and walking down the page table. The VMA splitting can be nasty, as it might require a couple of slab allocations, and doing an AVL insertion. (More nastiness in the case of shared memory or file mapping, too.) But nothing like playing with TLBs. Any reason why make_pages_present() is not the really oversized hammer it seems to be? -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Direct Sockets Support??
[EMAIL PROTECTED] said: But in the case of an application which fits in main memory, and has been running for a while (so all pages are present and dirty), all you'd really have to do is verify the page tables are in the proper state and skip the TLB flush, right? We really cannot assume this. There are two cases a. when a user app wants to receive some data, it allocates memory(using malloc) and waits for the hw to do zero-copy read. The kernel does not allocate physical page frames for the entire memory region allocated. We need to lock the memory (and locking is expensive due to costly TLB flushes) to do this b. when a user app wants to send data, he fills the buffer and waits for the hw to transmit data, but under heavy physical memory pressure, the swapper might swap the pages we want to transmit. So we need to lock the memory to be 100% sure. You're right, of course. But I suspect that the fast path of re-locking memory which is happily in core will go much faster by removing the multi-processor TLB purge. And it can't hurt, unless I'm missing something. -- Pete --- linux-2.4.4-stock/mm/mlock.cTue May 8 17:26:34 2001 +++ linux/mm/mlock.cTue May 8 17:24:13 2001 @@ -114,6 +114,10 @@ return 0; } +/* implemented in mm/memory.c */ +extern int mlock_make_pages_present(struct vm_area_struct *vma, + unsigned long addr, unsigned long end); + static int mlock_fixup(struct vm_area_struct * vma, unsigned long start, unsigned long end, unsigned int newflags) { @@ -138,7 +142,7 @@ pages = (end - start) PAGE_SHIFT; if (newflags VM_LOCKED) { pages = -pages; - make_pages_present(start, end); + mlock_make_pages_present(vma, start, end); } vma-vm_mm-locked_vm -= pages; } --- linux-2.4.4-stock/mm/memory.c Tue May 8 17:25:36 2001 +++ linux/mm/memory.c Tue May 8 17:24:40 2001 @@ -1438,3 +1438,80 @@ } while (addr end); return 0; } + +/* + * Specialized version of make_pages_present which does not require + * a multi-processor TLB purge for every page if nothing about the PTE + * was modified. + */ +int mlock_make_pages_present(struct vm_area_struct *vma, + unsigned long addr, unsigned long end) +{ + int ret, write; + struct mm_struct *mm = current-mm; + + write = (vma-vm_flags VM_WRITE) != 0; + + /* +* We need the page table lock to synchronize with kswapd +* and the SMP-safe atomic PTE updates. +*/ + spin_lock(mm-page_table_lock); + + ret = 0; + for (ret=0; !ret addr end; addr += PAGE_SIZE) { + pgd_t *pgd; + pmd_t *pmd; + pte_t *pte, entry; + int modified; + + current-state = TASK_RUNNING; + pgd = pgd_offset(mm, addr); + pmd = pmd_alloc(mm, pgd, addr); + if (!pmd) { + ret = -1; + break; + } + pte = pte_alloc(mm, pmd, addr); + if (!pte) { + ret = -1; + break; + } + entry = *pte; + if (!pte_present(entry)) { + /* +* If it truly wasn't present, we know that kswapd +* and the PTE updates will not touch it later. So +* drop the lock. +*/ + if (pte_none(entry)) { + ret = do_no_page(mm, vma, addr, write, pte); + continue; + } + ret = do_swap_page(mm, vma, addr, pte, + pte_to_swp_entry(entry), write); + continue; + } + + modified = 0; + if (write) { + if (!pte_write(entry)) { + ret = do_wp_page(mm, vma, addr, pte, entry); + continue; + } + if (!pte_dirty(entry)) { + entry = pte_mkdirty(entry); + modified = 1; + } + } + if (!pte_young(entry)) { + entry = pte_mkyoung(entry); + modified = 1; + } + if (modified) + establish_pte(vma, addr, pte, entry); + } + + spin_unlock(mm-page_table_lock); + return ret; +} - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ
Re: Question on mmap(2) with kernel alocated memory
[EMAIL PROTECTED] said: > I am trying to mmap() into user space a kernel buffer and am having > problems. > I have a simple test example, can someone please tell me what I have got > wrong ? > > In a driver I do: > uint*kva; > > kva = (uint*)kmalloc(4096, GFP_KERNEL); > *kva = 0x11223344; > printk("Address: %p %lx %x\n", kva, virt_to_phys(kva), *kva); > > Now in some simple user program I do: > > #include > #include > #include > #include > #include > > int main(int argc, char** argv){ > int fm; > char* p; > uint* pi; > uint v; > uint add = 0x74b000; > > if((fm = open("/dev/mem", O_RDWR)) < 0) > return 1; > > p = mmap(0, 128 * 1024 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fm, > 0); > printf("Mapped: %p\n", p); > > lseek(fm, add, SEEK_SET); > read(fm, , sizeof(v)); > printf("V: %x\n", v); > > pi = (uint*)(p + add); > printf("Vmmap: %p %x\n", pi, *pi); > > close(fm); > return 0; > } > > The value of add is hardcoded to the value printed for the physical > address in the drivers prink routine. > The lseek/read from the /dev/mem device yields the value 0x11223344. > However the mmap method also on /dev/mem yields the value 0. > > Whats wrong with my mmap() or kalloc() ? Executive summary: mmap of /dev/mem gives different values than lseek/read of /dev/mem Using lseek/read gives back bits of physical memory, but dereferencing a pointer into the mmaped area mostly gives zeroes. The file-system specific mmap routine, mmap_mem, calls remap_page_range to do the work of remapping the physical pages into user space. But eventually in remap_pte_range there's a check if ((!VALID_PAGE(page)) || PageReserved(page)) set_pte(pte, mk_pte_phys(phys_addr, prot)); And the effect is that only the reserved pages and those outside of the physical memory space get mapped. This isn't intuitive for /dev/mem, but is it the intended behavior? You could replicate those 80 odd lines of remap_page_range and helpers to get a version without the PageReserved test. Yuk. Quick hack: SetPageReserved(virt_to_page(kva)); but you may want to undo that (Clear...) before kfree-ing the page, as I'm not sure what will be confused by that. Longer term solution: write your own mmap routine in the driver, have the user code open /dev/my-driver, and mmap() on that instead of going through /dev/mem with a hardcoded address. -- Pete - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question on mmap(2) with kernel alocated memory
[EMAIL PROTECTED] said: I am trying to mmap() into user space a kernel buffer and am having problems. I have a simple test example, can someone please tell me what I have got wrong ? In a driver I do: uint*kva; kva = (uint*)kmalloc(4096, GFP_KERNEL); *kva = 0x11223344; printk(Address: %p %lx %x\n, kva, virt_to_phys(kva), *kva); Now in some simple user program I do: #include stdio.h #include string.h #include stdlib.h #include sys/mman.h #include fcntl.h int main(int argc, char** argv){ int fm; char* p; uint* pi; uint v; uint add = 0x74b000; if((fm = open(/dev/mem, O_RDWR)) 0) return 1; p = mmap(0, 128 * 1024 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fm, 0); printf(Mapped: %p\n, p); lseek(fm, add, SEEK_SET); read(fm, v, sizeof(v)); printf(V: %x\n, v); pi = (uint*)(p + add); printf(Vmmap: %p %x\n, pi, *pi); close(fm); return 0; } The value of add is hardcoded to the value printed for the physical address in the drivers prink routine. The lseek/read from the /dev/mem device yields the value 0x11223344. However the mmap method also on /dev/mem yields the value 0. Whats wrong with my mmap() or kalloc() ? Executive summary: mmap of /dev/mem gives different values than lseek/read of /dev/mem Using lseek/read gives back bits of physical memory, but dereferencing a pointer into the mmaped area mostly gives zeroes. The file-system specific mmap routine, mmap_mem, calls remap_page_range to do the work of remapping the physical pages into user space. But eventually in remap_pte_range there's a check if ((!VALID_PAGE(page)) || PageReserved(page)) set_pte(pte, mk_pte_phys(phys_addr, prot)); And the effect is that only the reserved pages and those outside of the physical memory space get mapped. This isn't intuitive for /dev/mem, but is it the intended behavior? You could replicate those 80 odd lines of remap_page_range and helpers to get a version without the PageReserved test. Yuk. Quick hack: SetPageReserved(virt_to_page(kva)); but you may want to undo that (Clear...) before kfree-ing the page, as I'm not sure what will be confused by that. Longer term solution: write your own mmap routine in the driver, have the user code open /dev/my-driver, and mmap() on that instead of going through /dev/mem with a hardcoded address. -- Pete - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is jumbo ethernet MTU possible with Hamachi Gigabit ethernet driver?
The hardware also must support jumbo MTUs. Hamachi limit is 1518 or 1522 (vlan) bytes, and the driver can't fix that. -- Pete [EMAIL PROTECTED] said: > I'm trying to get some Gigabit ethernet cards that use the Packet > Engines Hamachi GNIC-II chip to use a large mtu to attempt to get a > throughput of close to the 1Gb rating of the card. This is on a Compact > PCI Alpha system. I'm trying to use an MTU in the 8000 to 9000 range > and so far have not been able to get these MTUs to work. > > I have changed the PKT_BUF_SZ and MAX_FRAME_SIZE constants in hamachi.c > and ETH_DATA_LEN and ETH_FRAME_LEN in if_ether.h. I can use ifconfig to > change the MTU above 1500 on one side of a connection but as soon as I > raise the MTU on both sides to greater than 1500 the link dies. I can > change the MTU with ifconfig back to 1500 and the link will resume > operation. We are currently somewhat married to the 2.2.14 kernel. > > I read that some ethernet drivers will support jumbo MTUs. There > appears to be something in the hamachi driver or the kernel that I've > missed. Perhaps this only works with a later version kernel or the > hamachi driver needs more changes? Any help would be appreciated. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is jumbo ethernet MTU possible with Hamachi Gigabit ethernet driver?
The hardware also must support jumbo MTUs. Hamachi limit is 1518 or 1522 (vlan) bytes, and the driver can't fix that. -- Pete [EMAIL PROTECTED] said: I'm trying to get some Gigabit ethernet cards that use the Packet Engines Hamachi GNIC-II chip to use a large mtu to attempt to get a throughput of close to the 1Gb rating of the card. This is on a Compact PCI Alpha system. I'm trying to use an MTU in the 8000 to 9000 range and so far have not been able to get these MTUs to work. I have changed the PKT_BUF_SZ and MAX_FRAME_SIZE constants in hamachi.c and ETH_DATA_LEN and ETH_FRAME_LEN in if_ether.h. I can use ifconfig to change the MTU above 1500 on one side of a connection but as soon as I raise the MTU on both sides to greater than 1500 the link dies. I can change the MTU with ifconfig back to 1500 and the link will resume operation. We are currently somewhat married to the 2.2.14 kernel. I read that some ethernet drivers will support jumbo MTUs. There appears to be something in the hamachi driver or the kernel that I've missed. Perhaps this only works with a later version kernel or the hamachi driver needs more changes? Any help would be appreciated. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ns558 pci/isa change
My joystick stopped working at 2.4.0-test10 due to a patch to drivers/char/joystick/ns558.c that moves pci probing ahead of isa probing. The problem is that pci_module_init can return -ENODEV into i, which is then used to index the ISA portlist. ISA probing assumes that i is initialized to zero. Trivial fix attached, with plenty of context. -- Pete --- drivers/char/joystick/ns558.c.orig Mon Nov 13 18:04:16 2000 +++ drivers/char/joystick/ns558.c Mon Nov 13 18:11:41 2000 @@ -299,37 +299,38 @@ deactivate: if (dev->deactivate) dev->deactivate(dev); return next; } #endif int __init ns558_init(void) { - int i = 0; + int i; #ifdef NSS558_ISAPNP struct pci_dev *dev = NULL; struct pnp_devid *devid; #endif /* * Probe for PCI ports. Always probe for PCI first, * it is the least-invasive probe. */ i = pci_module_init(_pci_driver); if (i == 0) have_pci_devices = 1; /* * Probe for ISA ports. */ + i = 0; while (ns558_isa_portlist[i]) ns558 = ns558_isa_probe(ns558_isa_portlist[i++], ns558); /* * Probe for PnP ports. */ #ifdef NSS558_ISAPNP for (devid = pnp_devids; devid->vendor; devid++) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
ns558 pci/isa change
My joystick stopped working at 2.4.0-test10 due to a patch to drivers/char/joystick/ns558.c that moves pci probing ahead of isa probing. The problem is that pci_module_init can return -ENODEV into i, which is then used to index the ISA portlist. ISA probing assumes that i is initialized to zero. Trivial fix attached, with plenty of context. -- Pete --- drivers/char/joystick/ns558.c.orig Mon Nov 13 18:04:16 2000 +++ drivers/char/joystick/ns558.c Mon Nov 13 18:11:41 2000 @@ -299,37 +299,38 @@ deactivate: if (dev-deactivate) dev-deactivate(dev); return next; } #endif int __init ns558_init(void) { - int i = 0; + int i; #ifdef NSS558_ISAPNP struct pci_dev *dev = NULL; struct pnp_devid *devid; #endif /* * Probe for PCI ports. Always probe for PCI first, * it is the least-invasive probe. */ i = pci_module_init(ns558_pci_driver); if (i == 0) have_pci_devices = 1; /* * Probe for ISA ports. */ + i = 0; while (ns558_isa_portlist[i]) ns558 = ns558_isa_probe(ns558_isa_portlist[i++], ns558); /* * Probe for PnP ports. */ #ifdef NSS558_ISAPNP for (devid = pnp_devids; devid-vendor; devid++) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/