crash in rb_insert_color()
I have got a crash on an 8 processor ia64 machine with the kernel 2.6.18.8: [] ia64_fault+0x14b0/0x1600 sp=e0019ea5f930 bsp=e0019ea51408 r32 : 001a r33 : 00040020 r34 : 0010 r35 : 0003800f r36 : e0019ea5fbb8 r37 : e0019ea5fbb0 r38 : a0010027f2e0 r39 : e0019ea5f9c0 r40 : 000b r41 : a001c4e0 r42 : 0005 r43 : a00100a90050 [] ia64_leave_kernel+0x0/0x280 sp=e0019ea5fb60 bsp=e0019ea51408 [] rb_insert_color+0x60/0x380 sp=e0019ea5fd30 bsp=e0019ea513b8 r32 : e004acb19448 r33 : e0052a7ba100 r34 : e002799a5908 r35 : r36 : a002046f7810 r37 : 0690 r38 : 6f732e7365725f73 r39 : e004acb1c708 r40 : e0052a7ba100 [] ext3_htree_store_dirent+0x270/0x2e0 [ext3] sp=e0019ea5fd30 bsp=e0019ea51350 r32 : e0052a7ba100 r33 : 59bf4f42 r34 : 26f5c228 r35 : e0082bb9063c r36 : e004acb19440 r37 : e002799a5918 r38 : 003d r39 : e0082bb90642 r40 : e004acb19444 r41 : e002799a5908 r42 : a00204708a70 r43 : 0611 r44 : a00204716448 [] htree_dirblock_to_tree+0x1b0/0x280 [ext3] sp=e0019ea5fd30 bsp=e0019ea512f0 r32 : e00126191780 r33 : e0082bb907f8 r34 : e0082bb9063c r35 : e0019ea5fd50 r36 : r37 : r38 : e0014ff15b68 r39 : 0018 r40 : e0019ea5fd54 r41 : a00204708c40 r42 : 0795 r43 : a00204716448 [] ext3_htree_fill_tree+0x100/0x460 [ext3] sp=e0019ea5fd40 bsp=e0019ea51278 r32 : e00126191780 r33 : r34 : r35 : e0052a7ba128 r36 : a002046f82b0 r37 : e0014ff060b8 r38 : r39 : 00d0 r40 : 600101cc r41 : 600ffa5ff5b0 r42 : 600ffa5ff618 r43 : f000 r44 : a002046f82f0 r45 : 0fa6 r46 : a00204716448 [] ext3_readdir+0x890/0xcc0 [ext3] sp=e0019ea5fda0 bsp=e0019ea51178 r32 : e00126191780 r33 : e0019ea5fe20 r34 : a00100656570 r35 : e0052a7ba108 r36 : e0052a7ba100 r37 : e0052a7ba124 r38 : e0052a7ba120 r39 : e0014ff06160 r40 : e00126191830 r41 : e001261917b8 r42 : e0052a7ba128 r43 : e00126191790 r44 : e0052a7ba118 r45 : e0014ff060b8 r46 : e0052a7ba124 r47 : e0087a0b5980 r48 : e0087a0b59a0 r49 : e0052a7ba108 r50 : r51 : r52 : r53 : 0793 r54 : e0019ea5fe00 r55 : 0001 r56 : 0793 r57 : r58 : 2003cc7c r59 : a0010017bd60 r60 : 058e r61 : a00204716448 r62 : 1555a559 [] vfs_readdir+0x1a0/0x200 sp=e0019ea5fe10 bsp=e0019ea51120 r32 : e00126191780 r33 : a00100656570 r34 : e0019ea5fe20 r35 : fffe r36 : e001261917a0 r37 : e0014ff060b8 r38 : e00126191790 r39 : e0014ff06178 r40 : a0010017c540 r41 : 0915 r42 : a00100a90050 Here is the beginning of rb_insert_color(): void rb_insert_color(struct rb_node *node, struct rb_root *root) { struct rb_node *parent, *gparent; while ((parent = rb_parent(node)) && rb_is_red(parent)) { gparent = rb_parent(parent); if (parent == gparent->rb_left) { When it crashes, I've got the following nodes: p/x *((struct rb_node *) 0xe004acb19448)// node { rb_parent_color = 0xe002799a5908, rb_right = 0x0, rb_left = 0x0 } p/x *((struct rb_node *) 0xe002799a5908)// parent = rb_parent(node) { rb_parent_color = 0x0, rb_right = 0x0, rb_left = 0xa0010065 // seems to be valid } Note that "node" is not necessarily the same as the argument of rb_insert_color() was. gparent = rb_parent(parent) is NULL, and the kernel obviously traps at if (parent == gparent->rb_left). Does the algorithm guarantee that if a node is red, it must have a parent? I guess yes. (I have not seen any modification in rb_insert_color() since my 2.6.18.8.) Is mutex_lock(>i_mutex) in vfs_readdir() enough to protect the tree? I guess yes. I have got no idea how it can happen. Has something been corrected since? Regards, Zoltan Menyhart -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
crash in rb_insert_color()
I have got a crash on an 8 processor ia64 machine with the kernel 2.6.18.8: [a0010003cb30] ia64_fault+0x14b0/0x1600 sp=e0019ea5f930 bsp=e0019ea51408 r32 : 001a r33 : 00040020 r34 : 0010 r35 : 0003800f r36 : e0019ea5fbb8 r37 : e0019ea5fbb0 r38 : a0010027f2e0 r39 : e0019ea5f9c0 r40 : 000b r41 : a001c4e0 r42 : 0005 r43 : a00100a90050 [a001c4e0] ia64_leave_kernel+0x0/0x280 sp=e0019ea5fb60 bsp=e0019ea51408 [a0010027f2e0] rb_insert_color+0x60/0x380 sp=e0019ea5fd30 bsp=e0019ea513b8 r32 : e004acb19448 r33 : e0052a7ba100 r34 : e002799a5908 r35 : r36 : a002046f7810 r37 : 0690 r38 : 6f732e7365725f73 r39 : e004acb1c708 r40 : e0052a7ba100 [a002046f7810] ext3_htree_store_dirent+0x270/0x2e0 [ext3] sp=e0019ea5fd30 bsp=e0019ea51350 r32 : e0052a7ba100 r33 : 59bf4f42 r34 : 26f5c228 r35 : e0082bb9063c r36 : e004acb19440 r37 : e002799a5918 r38 : 003d r39 : e0082bb90642 r40 : e004acb19444 r41 : e002799a5908 r42 : a00204708a70 r43 : 0611 r44 : a00204716448 [a00204708a70] htree_dirblock_to_tree+0x1b0/0x280 [ext3] sp=e0019ea5fd30 bsp=e0019ea512f0 r32 : e00126191780 r33 : e0082bb907f8 r34 : e0082bb9063c r35 : e0019ea5fd50 r36 : r37 : r38 : e0014ff15b68 r39 : 0018 r40 : e0019ea5fd54 r41 : a00204708c40 r42 : 0795 r43 : a00204716448 [a00204708c40] ext3_htree_fill_tree+0x100/0x460 [ext3] sp=e0019ea5fd40 bsp=e0019ea51278 r32 : e00126191780 r33 : r34 : r35 : e0052a7ba128 r36 : a002046f82b0 r37 : e0014ff060b8 r38 : r39 : 00d0 r40 : 600101cc r41 : 600ffa5ff5b0 r42 : 600ffa5ff618 r43 : f000 r44 : a002046f82f0 r45 : 0fa6 r46 : a00204716448 [a002046f82f0] ext3_readdir+0x890/0xcc0 [ext3] sp=e0019ea5fda0 bsp=e0019ea51178 r32 : e00126191780 r33 : e0019ea5fe20 r34 : a00100656570 r35 : e0052a7ba108 r36 : e0052a7ba100 r37 : e0052a7ba124 r38 : e0052a7ba120 r39 : e0014ff06160 r40 : e00126191830 r41 : e001261917b8 r42 : e0052a7ba128 r43 : e00126191790 r44 : e0052a7ba118 r45 : e0014ff060b8 r46 : e0052a7ba124 r47 : e0087a0b5980 r48 : e0087a0b59a0 r49 : e0052a7ba108 r50 : r51 : r52 : r53 : 0793 r54 : e0019ea5fe00 r55 : 0001 r56 : 0793 r57 : r58 : 2003cc7c r59 : a0010017bd60 r60 : 058e r61 : a00204716448 r62 : 1555a559 [a0010017bd60] vfs_readdir+0x1a0/0x200 sp=e0019ea5fe10 bsp=e0019ea51120 r32 : e00126191780 r33 : a00100656570 r34 : e0019ea5fe20 r35 : fffe r36 : e001261917a0 r37 : e0014ff060b8 r38 : e00126191790 r39 : e0014ff06178 r40 : a0010017c540 r41 : 0915 r42 : a00100a90050 Here is the beginning of rb_insert_color(): void rb_insert_color(struct rb_node *node, struct rb_root *root) { struct rb_node *parent, *gparent; while ((parent = rb_parent(node)) rb_is_red(parent)) { gparent = rb_parent(parent); if (parent == gparent-rb_left) { When it crashes, I've got the following nodes: p/x *((struct rb_node *) 0xe004acb19448)// node { rb_parent_color = 0xe002799a5908, rb_right = 0x0, rb_left = 0x0 } p/x *((struct rb_node *) 0xe002799a5908)// parent = rb_parent(node) { rb_parent_color = 0x0, rb_right = 0x0, rb_left = 0xa0010065 // seems to be valid } Note that node is not necessarily the same as the argument of rb_insert_color() was. gparent = rb_parent(parent) is NULL, and the kernel obviously traps at if (parent == gparent-rb_left). Does the algorithm guarantee that if a node is red, it must have a parent? I guess yes. (I have not seen any modification in rb_insert_color() since my 2.6.18.8.) Is mutex_lock(inode-i_mutex) in vfs_readdir() enough to protect the tree? I guess yes. I have got no idea how it can happen. Has something been corrected since? Regards, Zoltan Menyhart -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org
__down() fails to provide with acquisition semantics
I have got a concern about "__down()". Can someone explain me, please, how it is assumed to work? Apparently, "__down()" fails to provide with acquisition semantics in certain situations. Let's assume the semaphore is unavailable and there is nobody waiting for it. The next requester enters the slow path. Let's assume the owner of the semaphore releases it just before the requester would execute the line: if (!atomic_add_negative(sleepers - 1, >count)) { Therefore the loop gets broken, the requester returns without going to sleep. atomic_add_negative(): atomic_add_return(): ia64_fetch_and_add(): ia64_fetchadd(i, v, rel) At least on ia64, "atomic_add_negative()" provides with release semantics. "remove_wait_queue_locked()" does not care for acquisition semantics. "wake_up_locked()" finds an empty list, it does nothing. "spin_unlock_irqrestore()" does release semantics. The requester is granted the semaphore and s/he enters the critical section without making sure that the memory accesses s/he wants to issue in the critical section cannot be made globally visible before "atomic_add_negative()" is. We need an acquisition semantics (at least) before entering any critical section. Should not we have something like: if (atomic_add_acq(sleepers - 1, >count) /* ret: new val */ >= 0){ "atomic_add_acq()" would provide with acquisition semantics in an architecture dependent way. I think it should be made more explicit that this routine should provide with the architecture dependent memory fencing. What is the use of the "wake_up_locked(>wait)"? The other requester woke up, will find the semaphore unavailable... Another question: is there any reason keeping an ia64 version when lib/semaphore-sleepers.c and arch/ia64/kernel/semaphore.c do not really differ? Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
__down() fails to provide with acquisition semantics
I have got a concern about __down(). Can someone explain me, please, how it is assumed to work? Apparently, __down() fails to provide with acquisition semantics in certain situations. Let's assume the semaphore is unavailable and there is nobody waiting for it. The next requester enters the slow path. Let's assume the owner of the semaphore releases it just before the requester would execute the line: if (!atomic_add_negative(sleepers - 1, sem-count)) { Therefore the loop gets broken, the requester returns without going to sleep. atomic_add_negative(): atomic_add_return(): ia64_fetch_and_add(): ia64_fetchadd(i, v, rel) At least on ia64, atomic_add_negative() provides with release semantics. remove_wait_queue_locked() does not care for acquisition semantics. wake_up_locked() finds an empty list, it does nothing. spin_unlock_irqrestore() does release semantics. The requester is granted the semaphore and s/he enters the critical section without making sure that the memory accesses s/he wants to issue in the critical section cannot be made globally visible before atomic_add_negative() is. We need an acquisition semantics (at least) before entering any critical section. Should not we have something like: if (atomic_add_acq(sleepers - 1, sem-count) /* ret: new val */ = 0){ atomic_add_acq() would provide with acquisition semantics in an architecture dependent way. I think it should be made more explicit that this routine should provide with the architecture dependent memory fencing. What is the use of the wake_up_locked(sem-wait)? The other requester woke up, will find the semaphore unavailable... Another question: is there any reason keeping an ia64 version when lib/semaphore-sleepers.c and arch/ia64/kernel/semaphore.c do not really differ? Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
Luck, Tony wrote: This seems crazy to me. Flushing should occur according to the *architecture*, not model-by-model. Even if we happen to get "lucky" on pre-Montecito CPUs, that doesn't justify such ugly hacks. Or you really want to debug this *again* come next CPU? Ditto. The only reason we should ever have model specific checks should be to work around model specific errata (e.g. the McKinley Errata #9 code in patch.c). You do have model specific I cache semantics. Not taking it into account will oblige you to flush in vain for the models which do not require it. Why do you want to take this option? Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
Jim Hull wrote: Not just crazy, but wrong - this *can* happen on pre-Montecito. Even though L1D is write-through and L2 was mixed I/D, the L1 I-cache could contain stale instrutions if there are missing flushes. I cannot agree with you. In order to consider an L1 I-cache entry as valid, a corresponding virtual -> physic address translation should be valid in one of the L1 ITLBs. "See 6.1.1. Instruction TLBS" of the I2 Proc. Ref. Man. for SW Dev. & Opt. You cannot have a valid L1 ITLB entry unless you have a corresponding valid L2 ITLB entry. When you remove a PTE (or switch off the exec bit) and you flush the L2 ITLB matching the old translation (and you kill the corresponding L1 ITLBs), you do invalidate the corresponding L1 I-cache entries. Therefore CPU models without split L2 caches are safe. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
Jim Hull wrote: Not just crazy, but wrong - this *can* happen on pre-Montecito. Even though L1D is write-through and L2 was mixed I/D, the L1 I-cache could contain stale instrutions if there are missing flushes. I cannot agree with you. In order to consider an L1 I-cache entry as valid, a corresponding virtual - physic address translation should be valid in one of the L1 ITLBs. See 6.1.1. Instruction TLBS of the I2 Proc. Ref. Man. for SW Dev. Opt. You cannot have a valid L1 ITLB entry unless you have a corresponding valid L2 ITLB entry. When you remove a PTE (or switch off the exec bit) and you flush the L2 ITLB matching the old translation (and you kill the corresponding L1 ITLBs), you do invalidate the corresponding L1 I-cache entries. Therefore CPU models without split L2 caches are safe. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
Luck, Tony wrote: This seems crazy to me. Flushing should occur according to the *architecture*, not model-by-model. Even if we happen to get lucky on pre-Montecito CPUs, that doesn't justify such ugly hacks. Or you really want to debug this *again* come next CPU? Ditto. The only reason we should ever have model specific checks should be to work around model specific errata (e.g. the McKinley Errata #9 code in patch.c). You do have model specific I cache semantics. Not taking it into account will oblige you to flush in vain for the models which do not require it. Why do you want to take this option? Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
KAMEZAWA Hiroyuki wrote: Could we discuss this in other thread as "optimization for some cpus" and push bug-fix patches first ? If take5 or part of take6(patch 1,2) are not acceptable, I'll continue this work on -rc2. Thanks, -Kame Sure. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
David Mosberger-Tang wrote: This seems crazy to me. Flushing should occur according to the *architecture*, not model-by-model. Even if we happen to get "lucky" on pre-Montecito CPUs, that doesn't justify such ugly hacks. Or you really want to debug this *again* come next CPU? --david O.K. let's say we flush by default: the global flag is set. We can have a (short) list of the CPU models which do not require this flush. If all of the CPUs are on the list then clear the global flag. And: static inline void sync_icache_dcache(pte_t pte) { if (pte_exec(pte) && global_flag) __sync_icache_dcache(pte); } Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
David Mosberger-Tang wrote: This seems crazy to me. Flushing should occur according to the *architecture*, not model-by-model. Even if we happen to get lucky on pre-Montecito CPUs, that doesn't justify such ugly hacks. Or you really want to debug this *again* come next CPU? --david O.K. let's say we flush by default: the global flag is set. We can have a (short) list of the CPU models which do not require this flush. If all of the CPUs are on the list then clear the global flag. And: static inline void sync_icache_dcache(pte_t pte) { if (pte_exec(pte) global_flag) __sync_icache_dcache(pte); } Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito
KAMEZAWA Hiroyuki wrote: Could we discuss this in other thread as optimization for some cpus and push bug-fix patches first ? If take5 or part of take6(patch 1,2) are not acceptable, I'll continue this work on -rc2. Thanks, -Kame Sure. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte() take 5.
KAMEZAWA Hiroyuki wrote: Considerations: - I can add CONFIG_MONTECITO if necessary. But it will be confusing, I think. What about this trick below? identify_cpu() finds out the "c->family". If any of the CPUs has c->family==32 (and the future versions...) then set a global flag. And: static inline void sync_icache_dcache(pte_t pte) { if (pte_exec(pte) && global_flag) __sync_icache_dcache(pte); } Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte() take 5.
KAMEZAWA Hiroyuki wrote: Considerations: - I can add CONFIG_MONTECITO if necessary. But it will be confusing, I think. What about this trick below? identify_cpu() finds out the c-family. If any of the CPUs has c-family==32 (and the future versions...) then set a global flag. And: static inline void sync_icache_dcache(pte_t pte) { if (pte_exec(pte) global_flag) __sync_icache_dcache(pte); } Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush cache fixes for ia64 [2/2] sync icache dcache
Do you really need a "sync_icache_dcache()" in "do_wp_page()"? I guess it is only needed when the EXEC bit of the PTE gets turned on. --- linux-2.6.23-rc1.test.orig/mm/migrate.c +++ linux-2.6.23-rc1.test/mm/migrate.c @@ -172,6 +172,7 @@ static void remove_migration_pte(struct pte = pte_mkold(mk_pte(new, vma->vm_page_prot)); if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); + flush_icache_page(vma, new); set_pte_at(mm, addr, ptep, pte); if (PageAnon(new)) I think "flush_icache_page()", or rather "flush_cache_page()" is to be placed where migration tears down the old virtual --> physical address translation for good. --- linux-2.6.23-rc1.test.orig/mm/migrate.c +++ linux-2.6.23-rc1.test/mm/migrate.c @@ -173,6 +173,7 @@ static void remove_migration_pte(struct if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); flush_icache_page(vma, new); + sync_icache_dcache(pte); set_pte_at(mm, addr, ptep, pte); Sure, I can agree to add "sync_icache_dcache()" here. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush cache fixes for ia64 [2/2] sync icache dcache
Do you really need a sync_icache_dcache() in do_wp_page()? I guess it is only needed when the EXEC bit of the PTE gets turned on. --- linux-2.6.23-rc1.test.orig/mm/migrate.c +++ linux-2.6.23-rc1.test/mm/migrate.c @@ -172,6 +172,7 @@ static void remove_migration_pte(struct pte = pte_mkold(mk_pte(new, vma-vm_page_prot)); if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); + flush_icache_page(vma, new); set_pte_at(mm, addr, ptep, pte); if (PageAnon(new)) I think flush_icache_page(), or rather flush_cache_page() is to be placed where migration tears down the old virtual -- physical address translation for good. --- linux-2.6.23-rc1.test.orig/mm/migrate.c +++ linux-2.6.23-rc1.test/mm/migrate.c @@ -173,6 +173,7 @@ static void remove_migration_pte(struct if (is_write_migration_entry(entry)) pte = pte_mkwrite(pte); flush_icache_page(vma, new); + sync_icache_dcache(pte); set_pte_at(mm, addr, ptep, pte); Sure, I can agree to add sync_icache_dcache() here. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte on ia64 take3
KAMEZAWA Hiroyuki wrote: Thank you for advise. Hmm..how about this ? == /* * synchronize icache and dcache if the hardware doesn't do it automatically * and the page is executable. */ static inline arch_sync_icache_dcache(pte_t pte, page) { } Why not? I just wanted to see if you change the existing interface or not. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Race condition around flush_cache_page() ?
Documentation/cachetlb.txt says: "In general, when Linux is changing an existing virtual-->physical mapping to a new value, the sequence will be ...: flush_cache_page(vma, addr, pfn); set_pte(pte_pointer, new_pte_val); flush_tlb_page(vma, addr); The cache level flush will always be first, because this allows us to properly handle systems whose caches are strict and require a virtual-->physical translation to exist for a virtual address when that virtual address is flushed from the cache." Let's assume we've got a multi threaded application, with 2 threads, each of them bound to its own CPU. Let's assume the CPU caches are tagged by virtual address and the virtual-->physical translation must remain valid while we flush the cache. CPU #1 is in a loop that accesses the stuff on a given virtual page. CPU #2 executes the sequence above, trying to tear down the mapping of the same page. Once CPU#2 has completed flush_cache_page() (that is not a NO-OP), there is no valid stuff of the page any more in any of the the CPU caches. CPU #2 will continue somewhat later to invalidate the PTE due to some HW delays. In the mean time, CPU #1 detects a cache miss. It has the valid translation in its TLB, it can re-fetch the cache line from the page. Now CPU #2 can really invalidate the PTE and flush the TLB-s. We can end up a cache line of the CPU #1 that has no more valid virtual-->physical translation. As far as I can understand, there is a contradiction: - either we keep the valid translation while flushing, and the other CPU can re-fetch the data - or we nuke the PTE first, and the flush wont work Can someone please explain me how it is assumed to work? Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte on ia64 take3
I do not think changing the semantics of flush_cache_page() is a good idea. Please have a look at Documentation/cachetlb.txt. This is for virtual address tagged caches. The doc. does not say much about flush_icache_page(). It is used in: install_page(): can release any previously existing mapping (similarly: for virtual address tagged I-cache) On the other hand, I do not see why it is used in these two cases, I think the virtual address tagged I-cache must have been flushed when the old PTE of the page was removed: do_no_page(): if (pte_none(*page_table)) { flush_icache_page(vma, new_page); do_swap_page(): flush_icache_page(vma, page); Anyway, it is a NO-OP on the machines with physical address tagged caches. Sure, the names could have been chosen more carefully. Unless you have a very strong reason, do not change their semantics (would require updates for the other architectures). I can agree: - lazy_mmu_prot_update() is not the right place for a flush - to flush I-cache before inserting PTE (at least for the machines with physical address tagged caches) Can we have a separate, specific exec_flush_phtag_icache_page() that is to be called before inserting a PTE with the exec bit on for physical address tagged I-caches) ? It is defined in asm-generic/cacheflush.h as NO-OP. Or exec_flush_icache_page_before_set_pte() ? Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte on ia64 take3
KAMEZAWA Hiroyuki wrote: Thank you for advise. Hmm..how about this ? == /* * synchronize icache and dcache if the hardware doesn't do it automatically * and the page is executable. */ static inline arch_sync_icache_dcache(pte_t pte, page) { } Why not? I just wanted to see if you change the existing interface or not. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] flush icache before set_pte on ia64 take3
I do not think changing the semantics of flush_cache_page() is a good idea. Please have a look at Documentation/cachetlb.txt. This is for virtual address tagged caches. The doc. does not say much about flush_icache_page(). It is used in: install_page(): can release any previously existing mapping (similarly: for virtual address tagged I-cache) On the other hand, I do not see why it is used in these two cases, I think the virtual address tagged I-cache must have been flushed when the old PTE of the page was removed: do_no_page(): if (pte_none(*page_table)) { flush_icache_page(vma, new_page); do_swap_page(): flush_icache_page(vma, page); Anyway, it is a NO-OP on the machines with physical address tagged caches. Sure, the names could have been chosen more carefully. Unless you have a very strong reason, do not change their semantics (would require updates for the other architectures). I can agree: - lazy_mmu_prot_update() is not the right place for a flush - to flush I-cache before inserting PTE (at least for the machines with physical address tagged caches) Can we have a separate, specific exec_flush_phtag_icache_page() that is to be called before inserting a PTE with the exec bit on for physical address tagged I-caches) ? It is defined in asm-generic/cacheflush.h as NO-OP. Or exec_flush_icache_page_before_set_pte() ? Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Race condition around flush_cache_page() ?
Documentation/cachetlb.txt says: In general, when Linux is changing an existing virtual--physical mapping to a new value, the sequence will be ...: flush_cache_page(vma, addr, pfn); set_pte(pte_pointer, new_pte_val); flush_tlb_page(vma, addr); The cache level flush will always be first, because this allows us to properly handle systems whose caches are strict and require a virtual--physical translation to exist for a virtual address when that virtual address is flushed from the cache. Let's assume we've got a multi threaded application, with 2 threads, each of them bound to its own CPU. Let's assume the CPU caches are tagged by virtual address and the virtual--physical translation must remain valid while we flush the cache. CPU #1 is in a loop that accesses the stuff on a given virtual page. CPU #2 executes the sequence above, trying to tear down the mapping of the same page. Once CPU#2 has completed flush_cache_page() (that is not a NO-OP), there is no valid stuff of the page any more in any of the the CPU caches. CPU #2 will continue somewhat later to invalidate the PTE due to some HW delays. In the mean time, CPU #1 detects a cache miss. It has the valid translation in its TLB, it can re-fetch the cache line from the page. Now CPU #2 can really invalidate the PTE and flush the TLB-s. We can end up a cache line of the CPU #1 that has no more valid virtual--physical translation. As far as I can understand, there is a contradiction: - either we keep the valid translation while flushing, and the other CPU can re-fetch the data - or we nuke the PTE first, and the flush wont work Can someone please explain me how it is assumed to work? Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Hmm...but the current code flushes the page. just do it in "lazy" way. much difference ? I agree the current code flushes the I-cache for all kinds of file systems (for PTEs with the exec bit on). The error is that it does it after the PTE is written. In addition, I wanted to optimize it to gain a few %. Apparently this idea is not much welcome. I can agree that flushing the I-cache (if the architecture requires it) before setting the PTE eliminates the error. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: A bit new idea. How about this ? == - Set PG_arch_1 if "icache is *not* coherent" page-flags.h: * PG_arch_1 is an architecture specific page state bit. The generic code * guarantees that this bit is cleared for a page when it first is entered into * the page cache. I do not think you can easily change it. I can agree, making nfs_readpage() call an architecture dependent service is not an easy stuff either. :-) - make flush_dcache_page() to be empty func. - For Montecito, add kmap_atomic(). This function just set PG_arch1. kmap_atomic() is used at several places. Do you want to set PG_arch1, everywhere kmap_atomic() is called? Then, "the page which is copied by the kernel" is marked as "not icache coherent page" - icache_flush_page() just flushes a page which has PG_arch_1. - Anonymous page is always has PG_arch_1. Tkae care of Copy-On-Write. You can allocate (even in user mode) an anonymous page, hand-create or read() in some code from a file, and mprotect(, EXEC)-it. The page has to become I-cache coherent. I am not sure I can really understand your proposal. I cannot see how the compatibility to the existing code is made sure. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: But is it too costly that flushing icache page only if a page is newly installed into the system (PG_arch1) && it is mapped as executable ? Well it was a bit long time ago, I measured on a Tiger box with CPUs of 1.3 GHz: Flushing a page of 64 Kbytes, with modified data in D-cache (it's slower that not having modified data in the D-cache): 13.1 ... 14.7 usec. You may have quicker machines, but having more CPUs or a NUMA architecture can slow it down considerably: - more CPUs have to agree that that's the moment to carry out a flush - NUMA adds delay We may have, say 1 Gbyte / sec local i/o activity (using some RAIDs). Assume a few % of this 1 Gbyte is the program execution, or program swap in. It gives some hundreds of new exec pages / sec => some msec-s can be lost each sec. I can agree that it should not be a big deal :-) I don't want to leak this (stupid) corner case to the file system layer. Hmm...can't we do clever flushing (like your idea) in VM layer ? As the VM layer is designed to be independent of the page read in stuff... Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
Back to do_no_page(): if the new PTE includes the exec bit and PG_arch_1 is set, the page has to be flushed from the I-cache before the PTE is made globally visible. Sorry, I wanted to say: if the new PTE includes the exec bit and PG_arch_1 is NOT set Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Then, what should I do more for fixing this SIGILL problem ? -Kame I can think of a relatively cheap solution: New pages are allocated with the bit PG_arch_1 off, see page_cache_read() ... prep_new_page(), i.e. the I-cache is not coherent with the D-cache. page_cache_read() should add a macro, say: ARCH_PREP_PAGE_BEFORE_READ(page); before actually calling mapping->a_ops->readpage(file, page). This macro can be for ia64 something like: do { if (CPU_has_split_L2_I_cache) set_bit(PG_arch_1, >flags); } and empty for the the architectures non concerned. The file systems which are identified not to use HW tools to avoid split I-cache incoherency, e.g. nfs_readpage(), are required to add a macro, say: ARCH_CHECK_READ_PAGE_COHERENCY(page); This macro can be for ia64: do { if (CPU_has_split_L2_I_cache) clear_bit(PG_arch_1, >flags); } and empty for the the architectures non concerned. Back to do_no_page(): if the new PTE includes the exec bit and PG_arch_1 is set, the page has to be flushed from the I-cache before the PTE is made globally visible. File systems using local block devices with DMA are considered to be safe, with the exceptions of the bounce buffers. When you copy into the destination page, another macro should be added, say: ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec); #define ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec) \ ARCH_CHECK_READ_PAGE_COHERENCY(bio_vec->bv_page) Remote DMA based network file systems, e.g. Lustre on Quadrics, Infiniband are also considered to be safe. Thanks, Zoltan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Then, what should I do more for fixing this SIGILL problem ? -Kame I can think of a relatively cheap solution: New pages are allocated with the bit PG_arch_1 off, see page_cache_read() ... prep_new_page(), i.e. the I-cache is not coherent with the D-cache. page_cache_read() should add a macro, say: ARCH_PREP_PAGE_BEFORE_READ(page); before actually calling mapping-a_ops-readpage(file, page). This macro can be for ia64 something like: do { if (CPU_has_split_L2_I_cache) set_bit(PG_arch_1, page-flags); } and empty for the the architectures non concerned. The file systems which are identified not to use HW tools to avoid split I-cache incoherency, e.g. nfs_readpage(), are required to add a macro, say: ARCH_CHECK_READ_PAGE_COHERENCY(page); This macro can be for ia64: do { if (CPU_has_split_L2_I_cache) clear_bit(PG_arch_1, page-flags); } and empty for the the architectures non concerned. Back to do_no_page(): if the new PTE includes the exec bit and PG_arch_1 is set, the page has to be flushed from the I-cache before the PTE is made globally visible. File systems using local block devices with DMA are considered to be safe, with the exceptions of the bounce buffers. When you copy into the destination page, another macro should be added, say: ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec); #define ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec) \ ARCH_CHECK_READ_PAGE_COHERENCY(bio_vec-bv_page) Remote DMA based network file systems, e.g. Lustre on Quadrics, Infiniband are also considered to be safe. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
Back to do_no_page(): if the new PTE includes the exec bit and PG_arch_1 is set, the page has to be flushed from the I-cache before the PTE is made globally visible. Sorry, I wanted to say: if the new PTE includes the exec bit and PG_arch_1 is NOT set Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: But is it too costly that flushing icache page only if a page is newly installed into the system (PG_arch1) it is mapped as executable ? Well it was a bit long time ago, I measured on a Tiger box with CPUs of 1.3 GHz: Flushing a page of 64 Kbytes, with modified data in D-cache (it's slower that not having modified data in the D-cache): 13.1 ... 14.7 usec. You may have quicker machines, but having more CPUs or a NUMA architecture can slow it down considerably: - more CPUs have to agree that that's the moment to carry out a flush - NUMA adds delay We may have, say 1 Gbyte / sec local i/o activity (using some RAIDs). Assume a few % of this 1 Gbyte is the program execution, or program swap in. It gives some hundreds of new exec pages / sec = some msec-s can be lost each sec. I can agree that it should not be a big deal :-) I don't want to leak this (stupid) corner case to the file system layer. Hmm...can't we do clever flushing (like your idea) in VM layer ? As the VM layer is designed to be independent of the page read in stuff... Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: A bit new idea. How about this ? == - Set PG_arch_1 if icache is *not* coherent page-flags.h: * PG_arch_1 is an architecture specific page state bit. The generic code * guarantees that this bit is cleared for a page when it first is entered into * the page cache. I do not think you can easily change it. I can agree, making nfs_readpage() call an architecture dependent service is not an easy stuff either. :-) - make flush_dcache_page() to be empty func. - For Montecito, add kmap_atomic(). This function just set PG_arch1. kmap_atomic() is used at several places. Do you want to set PG_arch1, everywhere kmap_atomic() is called? Then, the page which is copied by the kernel is marked as not icache coherent page - icache_flush_page() just flushes a page which has PG_arch_1. - Anonymous page is always has PG_arch_1. Tkae care of Copy-On-Write. You can allocate (even in user mode) an anonymous page, hand-create or read() in some code from a file, and mprotect(, EXEC)-it. The page has to become I-cache coherent. I am not sure I can really understand your proposal. I cannot see how the compatibility to the existing code is made sure. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Hmm...but the current code flushes the page. just do it in lazy way. much difference ? I agree the current code flushes the I-cache for all kinds of file systems (for PTEs with the exec bit on). The error is that it does it after the PTE is written. In addition, I wanted to optimize it to gain a few %. Apparently this idea is not much welcome. I can agree that flushing the I-cache (if the architecture requires it) before setting the PTE eliminates the error. Thanks, Zoltan - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Note1: icache flush is called only when VM_EXEC flag is on and PG_arch_1 is not set. If you have not got the page in the cache, then the new page will be allocated with PG_arch_1 bit off. You are going to flush pages which are read by HW DMA, i.e. the L2I of Montecito does not keep old lines for those pages anyway. ...->a_ops->readpage() of "L2I safe" file systems should set PG_arch_1 if the CPU is ia64 and it has got separate L2I. On the other hand, arch. independent file systems should not play with PG_arch_1. The base kernel should export a macro for the file systems... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUGFIX]{PATCH] flush icache on ia64 take2
KAMEZAWA Hiroyuki wrote: Note1: icache flush is called only when VM_EXEC flag is on and PG_arch_1 is not set. If you have not got the page in the cache, then the new page will be allocated with PG_arch_1 bit off. You are going to flush pages which are read by HW DMA, i.e. the L2I of Montecito does not keep old lines for those pages anyway. ...-a_ops-readpage() of L2I safe file systems should set PG_arch_1 if the CPU is ia64 and it has got separate L2I. On the other hand, arch. independent file systems should not play with PG_arch_1. The base kernel should export a macro for the file systems... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path
KAMEZAWA Hiroyuki wrote: On Wed, 04 Jul 2007 16:24:38 +0200 Zoltan Menyhart <[EMAIL PROTECTED]> wrote: Machines star up whit bit 5 = 0, reading instruction pages via NFS has to flush them from L2I. In our test, we confirmed that this can be fixed by flushing L2I just before SetPageUptodate() in NFS. I can agree. We can be more permissive: it can be done anywhere after the new data is put in place and before nfs_readpage() or nfs_readpages() returns. I saw your patch http://marc.info/?l=linux-mm=118352909826277=2 that modifies e.g. mm/memory.c and not the NFS layer. Have you proposed a patch against the NFS layer? I was wondering if instead of modifying do_no_page() and Co., should not we make nfs_readpage() be DMA-like? (No possible regression for most of the page I/O-s.) I.e. it should be the responsibility of a file system to make sure it supports instruction pages correctly. The base kernel should provide such file systems with an architecture dependent macro... IMHO, for example, race in cooy-on-write (was fixed by Tony Luck) has to be fixed by MemoryManagement layer. I can agree. Note that it has not got much performance impact from our point of view because there are not too many COW paves with executable code... And only a race in do_no_page() seems to be able to be fixed by FS layer. If it were do_no_page() that would fix the problem, then it should know where the page comes from in order not to flush the I-cache in vain. do_no_page() just calls vma->vm_ops->nopage() that goes down to fs_op->readpage() / fs_op->readpages(). On the other hand, these latter routines do not know why they fetch the page(s). Note that a page can be mapped more than one times, with different permission bits. As a consequence, these routines will flush the I-cache in vain in most of the cases. I prefer a performance impact to some non DMA based file systems and adding nothing to the path for the majority of the cases. BTW, can we know whether a page is filled by DMA or not ? Well, the file systems based on block devices use DMA transfer (with the exception of using bounce buffers, in that case it is the responsibility of the bounce buffering layer to flush the I-cache). Network based file systems require revision and update... Note that only the processors with separate L2I require attention, the L1I is guaranteed to be flushed when you change the corresponding TBL entry. The base kernel should provide a macro / service (that cares for the processor model) for the file systems. Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path
KAMEZAWA Hiroyuki wrote: On Wed, 04 Jul 2007 16:24:38 +0200 Zoltan Menyhart [EMAIL PROTECTED] wrote: Machines star up whit bit 5 = 0, reading instruction pages via NFS has to flush them from L2I. In our test, we confirmed that this can be fixed by flushing L2I just before SetPageUptodate() in NFS. I can agree. We can be more permissive: it can be done anywhere after the new data is put in place and before nfs_readpage() or nfs_readpages() returns. I saw your patch http://marc.info/?l=linux-mmm=118352909826277w=2 that modifies e.g. mm/memory.c and not the NFS layer. Have you proposed a patch against the NFS layer? I was wondering if instead of modifying do_no_page() and Co., should not we make nfs_readpage() be DMA-like? (No possible regression for most of the page I/O-s.) I.e. it should be the responsibility of a file system to make sure it supports instruction pages correctly. The base kernel should provide such file systems with an architecture dependent macro... IMHO, for example, race in cooy-on-write (was fixed by Tony Luck) has to be fixed by MemoryManagement layer. I can agree. Note that it has not got much performance impact from our point of view because there are not too many COW paves with executable code... And only a race in do_no_page() seems to be able to be fixed by FS layer. If it were do_no_page() that would fix the problem, then it should know where the page comes from in order not to flush the I-cache in vain. do_no_page() just calls vma-vm_ops-nopage() that goes down to fs_op-readpage() / fs_op-readpages(). On the other hand, these latter routines do not know why they fetch the page(s). Note that a page can be mapped more than one times, with different permission bits. As a consequence, these routines will flush the I-cache in vain in most of the cases. I prefer a performance impact to some non DMA based file systems and adding nothing to the path for the majority of the cases. BTW, can we know whether a page is filled by DMA or not ? Well, the file systems based on block devices use DMA transfer (with the exception of using bounce buffers, in that case it is the responsibility of the bounce buffering layer to flush the I-cache). Network based file systems require revision and update... Note that only the processors with separate L2I require attention, the L1I is guaranteed to be flushed when you change the corresponding TBL entry. The base kernel should provide a macro / service (that cares for the processor model) for the file systems. Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path
Could you please confirm that I understand correctly what is in the: Dual-Core Update to the Intel Itanium 2 Processor Reference Manual... "2.3.3.2 L2 Caches ... Any coherence request to identify whether a cache line is in the processor will invalidate that line from the L2I cache." This makes sure that the DMAs invalidate the L2L cache. "2.7.4 Instruction Cache Coherence Optimization Coherence requests of the L1I and L2I caches will invalidate the line if it is in the cache. Montecito allows instruction requests on the system interface to be filtered such that they will not initiate coherence requests of the L1I and L2I caches. This will allow instructions to be cached at the L1I and L2I levels across multiple processors in a coherent domain. This optimization is enabled by default, but may be disabled by PAL_SET_PROC_FEATURES bit 5 of the Montecito feature_set (18)." Machines star up whit bit 5 = 0, reading instruction pages via NFS has to flush them from L2I. I was wondering if instead of modifying do_no_page() and Co., should not we make nfs_readpage() be DMA-like? (No possible regression for most of the page I/O-s.) I.e. it should be the responsibility of a file system to make sure it supports instruction pages correctly. The base kernel should provide such file systems with an architecture dependent macro... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path
Could you please confirm that I understand correctly what is in the: Dual-Core Update to the Intel Itanium 2 Processor Reference Manual... 2.3.3.2 L2 Caches ... Any coherence request to identify whether a cache line is in the processor will invalidate that line from the L2I cache. This makes sure that the DMAs invalidate the L2L cache. 2.7.4 Instruction Cache Coherence Optimization Coherence requests of the L1I and L2I caches will invalidate the line if it is in the cache. Montecito allows instruction requests on the system interface to be filtered such that they will not initiate coherence requests of the L1I and L2I caches. This will allow instructions to be cached at the L1I and L2I levels across multiple processors in a coherent domain. This optimization is enabled by default, but may be disabled by PAL_SET_PROC_FEATURES bit 5 of the Montecito feature_set (18). Machines star up whit bit 5 = 0, reading instruction pages via NFS has to flush them from L2I. I was wondering if instead of modifying do_no_page() and Co., should not we make nfs_readpage() be DMA-like? (No possible regression for most of the page I/O-s.) I.e. it should be the responsibility of a file system to make sure it supports instruction pages correctly. The base kernel should provide such file systems with an architecture dependent macro... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
copy_one_pte()
I had a look at copy_one_pte(). I cannot see any ioproc_update_page() call, not even for the COW pages. Is it intentional? We can live with a COW page for a considerably long time. How could the IO-PROC. know that a process-ID / user virt. addr. pair refers to the same page? The comment above ioproc_update_page() says that every time when a PTE is created / modified... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
copy_one_pte()
I had a look at copy_one_pte(). I cannot see any ioproc_update_page() call, not even for the COW pages. Is it intentional? We can live with a COW page for a considerably long time. How could the IO-PROC. know that a process-ID / user virt. addr. pair refers to the same page? The comment above ioproc_update_page() says that every time when a PTE is created / modified... Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Mprotect needs arch hook for updated PTE settings
An application should not change the protection of its _own_ text region without knowing well the requirements of the given architecture. I do not think we should add anything into the kernel for this case. I did see /lib/ld-linux-ia64.so.* changing the protection of the text segment of the _victim_ application, when it linked the library references. ld-linux-ia64.so.* changes the protection for the whole text segment (otherwise, as the protection is per VMA, it would result in a VMA fragmentation). The text segment can be huge. There is no reason to flush all the text segment every time when ld-linux-ia64.so.* patches an instruction and changes the protection. I think the solution should consist of these two measures: 1. Let's say that if an VMA is "executable", then it remains "executable" for ever, i.e. the mprotect() keeps the PROT_EXEC bit. As a result, if a page is faulted in for this VMA in the mean time, the old good mechanism makes sure that the I-caches are flushed. 2. Let's modify ld-linux-.so.*: having patched an instruction, it should take the appropriate, architecture dependent measure, e.g. for ia64, it should issue an "fc" instruction. (Who cares for a debugger ? It should know what it does ;-).) I think there is no need for any extra flushes. Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Mprotect needs arch hook for updated PTE settings
An application should not change the protection of its _own_ text region without knowing well the requirements of the given architecture. I do not think we should add anything into the kernel for this case. I did see /lib/ld-linux-ia64.so.* changing the protection of the text segment of the _victim_ application, when it linked the library references. ld-linux-ia64.so.* changes the protection for the whole text segment (otherwise, as the protection is per VMA, it would result in a VMA fragmentation). The text segment can be huge. There is no reason to flush all the text segment every time when ld-linux-ia64.so.* patches an instruction and changes the protection. I think the solution should consist of these two measures: 1. Let's say that if an VMA is executable, then it remains executable for ever, i.e. the mprotect() keeps the PROT_EXEC bit. As a result, if a page is faulted in for this VMA in the mean time, the old good mechanism makes sure that the I-caches are flushed. 2. Let's modify ld-linux-arch.so.*: having patched an instruction, it should take the appropriate, architecture dependent measure, e.g. for ia64, it should issue an fc instruction. (Who cares for a debugger ? It should know what it does ;-).) I think there is no need for any extra flushes. Thanks, Zoltan Menyhart - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/