crash in rb_insert_color()

2008-02-01 Thread Zoltan Menyhart

I have got a crash on an 8 processor ia64 machine with the kernel 2.6.18.8:

[] ia64_fault+0x14b0/0x1600
   sp=e0019ea5f930 bsp=e0019ea51408
   r32 : 001a r33 : 00040020 r34 : 0010
   r35 : 0003800f r36 : e0019ea5fbb8 r37 : e0019ea5fbb0
   r38 : a0010027f2e0 r39 : e0019ea5f9c0 r40 : 000b
   r41 : a001c4e0 r42 : 0005 r43 : a00100a90050
[] ia64_leave_kernel+0x0/0x280
   sp=e0019ea5fb60 bsp=e0019ea51408
[] rb_insert_color+0x60/0x380
   sp=e0019ea5fd30 bsp=e0019ea513b8
   r32 : e004acb19448 r33 : e0052a7ba100 r34 : e002799a5908
   r35 :  r36 : a002046f7810 r37 : 0690
   r38 : 6f732e7365725f73 r39 : e004acb1c708 r40 : e0052a7ba100
[] ext3_htree_store_dirent+0x270/0x2e0 [ext3]
   sp=e0019ea5fd30 bsp=e0019ea51350
   r32 : e0052a7ba100 r33 : 59bf4f42 r34 : 26f5c228
   r35 : e0082bb9063c r36 : e004acb19440 r37 : e002799a5918
   r38 : 003d r39 : e0082bb90642 r40 : e004acb19444
   r41 : e002799a5908 r42 : a00204708a70 r43 : 0611
   r44 : a00204716448
[] htree_dirblock_to_tree+0x1b0/0x280 [ext3]
   sp=e0019ea5fd30 bsp=e0019ea512f0
   r32 : e00126191780 r33 : e0082bb907f8 r34 : e0082bb9063c
   r35 : e0019ea5fd50 r36 :  r37 : 
   r38 : e0014ff15b68 r39 : 0018 r40 : e0019ea5fd54
   r41 : a00204708c40 r42 : 0795 r43 : a00204716448
[] ext3_htree_fill_tree+0x100/0x460 [ext3]
   sp=e0019ea5fd40 bsp=e0019ea51278
   r32 : e00126191780 r33 :  r34 : 
   r35 : e0052a7ba128 r36 : a002046f82b0 r37 : e0014ff060b8
   r38 :  r39 : 00d0 r40 : 600101cc
   r41 : 600ffa5ff5b0 r42 : 600ffa5ff618 r43 : f000
   r44 : a002046f82f0 r45 : 0fa6 r46 : a00204716448
[] ext3_readdir+0x890/0xcc0 [ext3]
   sp=e0019ea5fda0 bsp=e0019ea51178
   r32 : e00126191780 r33 : e0019ea5fe20 r34 : a00100656570
   r35 : e0052a7ba108 r36 : e0052a7ba100 r37 : e0052a7ba124
   r38 : e0052a7ba120 r39 : e0014ff06160 r40 : e00126191830
   r41 : e001261917b8 r42 : e0052a7ba128 r43 : e00126191790
   r44 : e0052a7ba118 r45 : e0014ff060b8 r46 : e0052a7ba124
   r47 : e0087a0b5980 r48 : e0087a0b59a0 r49 : e0052a7ba108
   r50 :  r51 :  r52 : 
   r53 : 0793 r54 : e0019ea5fe00 r55 : 0001
   r56 : 0793 r57 :  r58 : 2003cc7c
   r59 : a0010017bd60 r60 : 058e r61 : a00204716448
   r62 : 1555a559
[] vfs_readdir+0x1a0/0x200
   sp=e0019ea5fe10 bsp=e0019ea51120
   r32 : e00126191780 r33 : a00100656570 r34 : e0019ea5fe20
   r35 : fffe r36 : e001261917a0 r37 : e0014ff060b8
   r38 : e00126191790 r39 : e0014ff06178 r40 : a0010017c540
   r41 : 0915 r42 : a00100a90050

Here is the beginning of rb_insert_color():

void rb_insert_color(struct rb_node *node, struct rb_root *root)
{
   struct rb_node *parent, *gparent;

   while ((parent = rb_parent(node)) && rb_is_red(parent)) {
   gparent = rb_parent(parent);
   if (parent == gparent->rb_left) {

When it crashes, I've got the following nodes:

p/x *((struct rb_node *) 0xe004acb19448)// node
{
 rb_parent_color = 0xe002799a5908,
 rb_right = 0x0,
 rb_left = 0x0
}
p/x *((struct rb_node *) 0xe002799a5908)// parent = rb_parent(node)
{
 rb_parent_color = 0x0,
 rb_right = 0x0,
 rb_left = 0xa0010065   // seems to be valid
}

Note that "node" is not necessarily the same as the argument of
rb_insert_color() was.

gparent = rb_parent(parent) is NULL, and the kernel obviously traps
at if (parent == gparent->rb_left).

Does the algorithm guarantee that if a node is red, it must have a parent?
I guess yes.
(I have not seen any modification in rb_insert_color() since my 2.6.18.8.)

Is mutex_lock(>i_mutex) in vfs_readdir() enough to protect the tree?
I guess yes.

I have got no idea how it can happen.
Has something been corrected since?

Regards,

Zoltan Menyhart


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


crash in rb_insert_color()

2008-02-01 Thread Zoltan Menyhart

I have got a crash on an 8 processor ia64 machine with the kernel 2.6.18.8:

[a0010003cb30] ia64_fault+0x14b0/0x1600
   sp=e0019ea5f930 bsp=e0019ea51408
   r32 : 001a r33 : 00040020 r34 : 0010
   r35 : 0003800f r36 : e0019ea5fbb8 r37 : e0019ea5fbb0
   r38 : a0010027f2e0 r39 : e0019ea5f9c0 r40 : 000b
   r41 : a001c4e0 r42 : 0005 r43 : a00100a90050
[a001c4e0] ia64_leave_kernel+0x0/0x280
   sp=e0019ea5fb60 bsp=e0019ea51408
[a0010027f2e0] rb_insert_color+0x60/0x380
   sp=e0019ea5fd30 bsp=e0019ea513b8
   r32 : e004acb19448 r33 : e0052a7ba100 r34 : e002799a5908
   r35 :  r36 : a002046f7810 r37 : 0690
   r38 : 6f732e7365725f73 r39 : e004acb1c708 r40 : e0052a7ba100
[a002046f7810] ext3_htree_store_dirent+0x270/0x2e0 [ext3]
   sp=e0019ea5fd30 bsp=e0019ea51350
   r32 : e0052a7ba100 r33 : 59bf4f42 r34 : 26f5c228
   r35 : e0082bb9063c r36 : e004acb19440 r37 : e002799a5918
   r38 : 003d r39 : e0082bb90642 r40 : e004acb19444
   r41 : e002799a5908 r42 : a00204708a70 r43 : 0611
   r44 : a00204716448
[a00204708a70] htree_dirblock_to_tree+0x1b0/0x280 [ext3]
   sp=e0019ea5fd30 bsp=e0019ea512f0
   r32 : e00126191780 r33 : e0082bb907f8 r34 : e0082bb9063c
   r35 : e0019ea5fd50 r36 :  r37 : 
   r38 : e0014ff15b68 r39 : 0018 r40 : e0019ea5fd54
   r41 : a00204708c40 r42 : 0795 r43 : a00204716448
[a00204708c40] ext3_htree_fill_tree+0x100/0x460 [ext3]
   sp=e0019ea5fd40 bsp=e0019ea51278
   r32 : e00126191780 r33 :  r34 : 
   r35 : e0052a7ba128 r36 : a002046f82b0 r37 : e0014ff060b8
   r38 :  r39 : 00d0 r40 : 600101cc
   r41 : 600ffa5ff5b0 r42 : 600ffa5ff618 r43 : f000
   r44 : a002046f82f0 r45 : 0fa6 r46 : a00204716448
[a002046f82f0] ext3_readdir+0x890/0xcc0 [ext3]
   sp=e0019ea5fda0 bsp=e0019ea51178
   r32 : e00126191780 r33 : e0019ea5fe20 r34 : a00100656570
   r35 : e0052a7ba108 r36 : e0052a7ba100 r37 : e0052a7ba124
   r38 : e0052a7ba120 r39 : e0014ff06160 r40 : e00126191830
   r41 : e001261917b8 r42 : e0052a7ba128 r43 : e00126191790
   r44 : e0052a7ba118 r45 : e0014ff060b8 r46 : e0052a7ba124
   r47 : e0087a0b5980 r48 : e0087a0b59a0 r49 : e0052a7ba108
   r50 :  r51 :  r52 : 
   r53 : 0793 r54 : e0019ea5fe00 r55 : 0001
   r56 : 0793 r57 :  r58 : 2003cc7c
   r59 : a0010017bd60 r60 : 058e r61 : a00204716448
   r62 : 1555a559
[a0010017bd60] vfs_readdir+0x1a0/0x200
   sp=e0019ea5fe10 bsp=e0019ea51120
   r32 : e00126191780 r33 : a00100656570 r34 : e0019ea5fe20
   r35 : fffe r36 : e001261917a0 r37 : e0014ff060b8
   r38 : e00126191790 r39 : e0014ff06178 r40 : a0010017c540
   r41 : 0915 r42 : a00100a90050

Here is the beginning of rb_insert_color():

void rb_insert_color(struct rb_node *node, struct rb_root *root)
{
   struct rb_node *parent, *gparent;

   while ((parent = rb_parent(node))  rb_is_red(parent)) {
   gparent = rb_parent(parent);
   if (parent == gparent-rb_left) {

When it crashes, I've got the following nodes:

p/x *((struct rb_node *) 0xe004acb19448)// node
{
 rb_parent_color = 0xe002799a5908,
 rb_right = 0x0,
 rb_left = 0x0
}
p/x *((struct rb_node *) 0xe002799a5908)// parent = rb_parent(node)
{
 rb_parent_color = 0x0,
 rb_right = 0x0,
 rb_left = 0xa0010065   // seems to be valid
}

Note that node is not necessarily the same as the argument of
rb_insert_color() was.

gparent = rb_parent(parent) is NULL, and the kernel obviously traps
at if (parent == gparent-rb_left).

Does the algorithm guarantee that if a node is red, it must have a parent?
I guess yes.
(I have not seen any modification in rb_insert_color() since my 2.6.18.8.)

Is mutex_lock(inode-i_mutex) in vfs_readdir() enough to protect the tree?
I guess yes.

I have got no idea how it can happen.
Has something been corrected since?

Regards,

Zoltan Menyhart


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org

__down() fails to provide with acquisition semantics

2007-09-14 Thread Zoltan Menyhart

I have got a concern about "__down()".
Can someone explain me, please, how it is assumed to work?

Apparently, "__down()" fails to provide with acquisition semantics
in certain situations.

Let's assume the semaphore is unavailable and there is nobody
waiting for it.

The next requester enters the slow path.

Let's assume the owner of the semaphore releases it just before
the requester would execute the line:

  if (!atomic_add_negative(sleepers - 1, >count)) {

Therefore the loop gets broken, the requester returns without
going to sleep.

  atomic_add_negative():
atomic_add_return():
ia64_fetch_and_add():
ia64_fetchadd(i, v, rel)

At least on ia64, "atomic_add_negative()" provides with release
semantics.

"remove_wait_queue_locked()" does not care for acquisition semantics.
"wake_up_locked()" finds an empty list, it does nothing.
"spin_unlock_irqrestore()" does release semantics.

The requester is granted the semaphore and s/he enters the critical
section without making sure that the memory accesses s/he wants to
issue in the critical section cannot be made globally visible before
"atomic_add_negative()" is.

We need an acquisition semantics (at least) before entering any
critical section. Should not we have something like:

  if (atomic_add_acq(sleepers - 1, >count) /* ret: new val */ >= 0){

"atomic_add_acq()" would provide with acquisition semantics in
an architecture dependent way.
I think it should be made more explicit that this routine should
provide with the architecture dependent memory fencing.


What is the use of the "wake_up_locked(>wait)"?
The other requester woke up, will find the semaphore unavailable...


Another question: is there any reason keeping an ia64 version when
lib/semaphore-sleepers.c and arch/ia64/kernel/semaphore.c do not
really differ?


Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


__down() fails to provide with acquisition semantics

2007-09-14 Thread Zoltan Menyhart

I have got a concern about __down().
Can someone explain me, please, how it is assumed to work?

Apparently, __down() fails to provide with acquisition semantics
in certain situations.

Let's assume the semaphore is unavailable and there is nobody
waiting for it.

The next requester enters the slow path.

Let's assume the owner of the semaphore releases it just before
the requester would execute the line:

  if (!atomic_add_negative(sleepers - 1, sem-count)) {

Therefore the loop gets broken, the requester returns without
going to sleep.

  atomic_add_negative():
atomic_add_return():
ia64_fetch_and_add():
ia64_fetchadd(i, v, rel)

At least on ia64, atomic_add_negative() provides with release
semantics.

remove_wait_queue_locked() does not care for acquisition semantics.
wake_up_locked() finds an empty list, it does nothing.
spin_unlock_irqrestore() does release semantics.

The requester is granted the semaphore and s/he enters the critical
section without making sure that the memory accesses s/he wants to
issue in the critical section cannot be made globally visible before
atomic_add_negative() is.

We need an acquisition semantics (at least) before entering any
critical section. Should not we have something like:

  if (atomic_add_acq(sleepers - 1, sem-count) /* ret: new val */ = 0){

atomic_add_acq() would provide with acquisition semantics in
an architecture dependent way.
I think it should be made more explicit that this routine should
provide with the architecture dependent memory fencing.


What is the use of the wake_up_locked(sem-wait)?
The other requester woke up, will find the semaphore unavailable...


Another question: is there any reason keeping an ia64 version when
lib/semaphore-sleepers.c and arch/ia64/kernel/semaphore.c do not
really differ?


Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-08-01 Thread Zoltan Menyhart

Luck, Tony wrote:

This seems crazy to me.  Flushing should occur according to the
*architecture*, not model-by-model.  Even if we happen to get "lucky"
on pre-Montecito CPUs, that doesn't justify such ugly hacks.  Or you
really want to debug this *again* come next CPU?



Ditto.  The only reason we should ever have model specific checks should
be to work around model specific errata (e.g. the McKinley Errata #9 code
in patch.c).


You do have model specific I cache semantics.
Not taking it into account will oblige you to flush in vain for the models
which do not require it. Why do you want to take this option?


Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-08-01 Thread Zoltan Menyhart

Jim Hull wrote:

Not just crazy, but wrong - this *can* happen on pre-Montecito.  Even though
L1D is write-through and L2 was mixed I/D, the L1 I-cache could contain
stale instrutions if there are missing flushes.


I cannot agree with you.

In order to consider an L1 I-cache entry as valid, a corresponding
virtual -> physic address translation should be valid in one of the L1 ITLBs.

"See 6.1.1. Instruction TLBS" of the I2 Proc. Ref. Man. for SW Dev. & Opt.

You cannot have a valid L1 ITLB entry unless you have a corresponding valid
L2 ITLB entry.

When you remove a PTE (or switch off the exec bit) and you flush the L2 ITLB
matching the old translation (and you kill the corresponding L1 ITLBs),
you do invalidate the corresponding L1 I-cache entries.

Therefore CPU models without split L2 caches are safe.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-08-01 Thread Zoltan Menyhart

Jim Hull wrote:

Not just crazy, but wrong - this *can* happen on pre-Montecito.  Even though
L1D is write-through and L2 was mixed I/D, the L1 I-cache could contain
stale instrutions if there are missing flushes.


I cannot agree with you.

In order to consider an L1 I-cache entry as valid, a corresponding
virtual - physic address translation should be valid in one of the L1 ITLBs.

See 6.1.1. Instruction TLBS of the I2 Proc. Ref. Man. for SW Dev.  Opt.

You cannot have a valid L1 ITLB entry unless you have a corresponding valid
L2 ITLB entry.

When you remove a PTE (or switch off the exec bit) and you flush the L2 ITLB
matching the old translation (and you kill the corresponding L1 ITLBs),
you do invalidate the corresponding L1 I-cache entries.

Therefore CPU models without split L2 caches are safe.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-08-01 Thread Zoltan Menyhart

Luck, Tony wrote:

This seems crazy to me.  Flushing should occur according to the
*architecture*, not model-by-model.  Even if we happen to get lucky
on pre-Montecito CPUs, that doesn't justify such ugly hacks.  Or you
really want to debug this *again* come next CPU?



Ditto.  The only reason we should ever have model specific checks should
be to work around model specific errata (e.g. the McKinley Errata #9 code
in patch.c).


You do have model specific I cache semantics.
Not taking it into account will oblige you to flush in vain for the models
which do not require it. Why do you want to take this option?


Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-07-31 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Could we discuss this in other thread as "optimization for some cpus" and
push bug-fix patches first ?

If take5 or part of take6(patch 1,2) are not acceptable, I'll continue this
work on -rc2.

Thanks,
-Kame


Sure.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-07-31 Thread Zoltan Menyhart

David Mosberger-Tang wrote:

This seems crazy to me.  Flushing should occur according to the
*architecture*, not model-by-model.  Even if we happen to get "lucky"
on pre-Montecito CPUs, that doesn't justify such ugly hacks.  Or you
really want to debug this *again* come next CPU?

  --david


O.K. let's say we flush by default: the global flag is set.

We can have a (short) list of the CPU models which do not require
this flush.

If all of the CPUs are on the list then clear the global flag. And:

static inline void sync_icache_dcache(pte_t pte) {
  if (pte_exec(pte) && global_flag)
  __sync_icache_dcache(pte);
}

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-07-31 Thread Zoltan Menyhart

David Mosberger-Tang wrote:

This seems crazy to me.  Flushing should occur according to the
*architecture*, not model-by-model.  Even if we happen to get lucky
on pre-Montecito CPUs, that doesn't justify such ugly hacks.  Or you
really want to debug this *again* come next CPU?

  --david


O.K. let's say we flush by default: the global flag is set.

We can have a (short) list of the CPU models which do not require
this flush.

If all of the CPUs are on the list then clear the global flag. And:

static inline void sync_icache_dcache(pte_t pte) {
  if (pte_exec(pte)  global_flag)
  __sync_icache_dcache(pte);
}

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte take6. [4/4] optimization for cpus other than montecito

2007-07-31 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Could we discuss this in other thread as optimization for some cpus and
push bug-fix patches first ?

If take5 or part of take6(patch 1,2) are not acceptable, I'll continue this
work on -rc2.

Thanks,
-Kame


Sure.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte() take 5.

2007-07-30 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Considerations:
  - I can add CONFIG_MONTECITO if necessary. But it will be confusing, I think.


What about this trick below?

identify_cpu() finds out the "c->family".
If any of the CPUs has c->family==32 (and the future versions...) then
set a global flag. And:

static inline void sync_icache_dcache(pte_t pte) {
   if (pte_exec(pte) && global_flag)
   __sync_icache_dcache(pte);
}

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte() take 5.

2007-07-30 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Considerations:
  - I can add CONFIG_MONTECITO if necessary. But it will be confusing, I think.


What about this trick below?

identify_cpu() finds out the c-family.
If any of the CPUs has c-family==32 (and the future versions...) then
set a global flag. And:

static inline void sync_icache_dcache(pte_t pte) {
   if (pte_exec(pte)  global_flag)
   __sync_icache_dcache(pte);
}

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush cache fixes for ia64 [2/2] sync icache dcache

2007-07-27 Thread Zoltan Menyhart

Do you really need a "sync_icache_dcache()" in "do_wp_page()"?
I guess it is only needed when the EXEC bit of the PTE gets turned on.


--- linux-2.6.23-rc1.test.orig/mm/migrate.c
+++ linux-2.6.23-rc1.test/mm/migrate.c
@@ -172,6 +172,7 @@ static void remove_migration_pte(struct 
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));

if (is_write_migration_entry(entry))
pte = pte_mkwrite(pte);
+   flush_icache_page(vma, new);
set_pte_at(mm, addr, ptep, pte);
 
 	if (PageAnon(new))


I think "flush_icache_page()", or rather "flush_cache_page()" is
to be placed where migration tears down the old virtual --> physical
address translation for good.


--- linux-2.6.23-rc1.test.orig/mm/migrate.c
+++ linux-2.6.23-rc1.test/mm/migrate.c
@@ -173,6 +173,7 @@ static void remove_migration_pte(struct 
 	if (is_write_migration_entry(entry))

pte = pte_mkwrite(pte);
flush_icache_page(vma, new);
+   sync_icache_dcache(pte);
set_pte_at(mm, addr, ptep, pte);


Sure, I can agree to add "sync_icache_dcache()" here.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush cache fixes for ia64 [2/2] sync icache dcache

2007-07-27 Thread Zoltan Menyhart

Do you really need a sync_icache_dcache() in do_wp_page()?
I guess it is only needed when the EXEC bit of the PTE gets turned on.


--- linux-2.6.23-rc1.test.orig/mm/migrate.c
+++ linux-2.6.23-rc1.test/mm/migrate.c
@@ -172,6 +172,7 @@ static void remove_migration_pte(struct 
 	pte = pte_mkold(mk_pte(new, vma-vm_page_prot));

if (is_write_migration_entry(entry))
pte = pte_mkwrite(pte);
+   flush_icache_page(vma, new);
set_pte_at(mm, addr, ptep, pte);
 
 	if (PageAnon(new))


I think flush_icache_page(), or rather flush_cache_page() is
to be placed where migration tears down the old virtual -- physical
address translation for good.


--- linux-2.6.23-rc1.test.orig/mm/migrate.c
+++ linux-2.6.23-rc1.test/mm/migrate.c
@@ -173,6 +173,7 @@ static void remove_migration_pte(struct 
 	if (is_write_migration_entry(entry))

pte = pte_mkwrite(pte);
flush_icache_page(vma, new);
+   sync_icache_dcache(pte);
set_pte_at(mm, addr, ptep, pte);


Sure, I can agree to add sync_icache_dcache() here.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte on ia64 take3

2007-07-26 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Thank you for advise. Hmm..how about this ?
==
 /*
  * synchronize icache and dcache if the hardware doesn't do it automatically
  * and the page is executable.
  */
static inline arch_sync_icache_dcache(pte_t pte, page) {

}


Why not?

I just wanted to see if you change the existing interface or not.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Race condition around flush_cache_page() ?

2007-07-26 Thread Zoltan Menyhart

Documentation/cachetlb.txt says:

"In general, when Linux is changing an existing virtual-->physical
mapping to a new value, the sequence will be ...:

  flush_cache_page(vma, addr, pfn);
  set_pte(pte_pointer, new_pte_val);
  flush_tlb_page(vma, addr);

The cache level flush will always be first, because this allows
us to properly handle systems whose caches are strict and require
a virtual-->physical translation to exist for a virtual address
when that virtual address is flushed from the cache."

Let's assume we've got a multi threaded application, with 2 threads,
each of them bound to its own CPU.
Let's assume the CPU caches are tagged by virtual address and the 
virtual-->physical translation must remain valid while we flush the

cache.

CPU #1 is in a loop that accesses the stuff on a given virtual page.

CPU #2 executes the sequence above, trying to tear down the mapping
of the same page. Once CPU#2 has completed flush_cache_page()
(that is not a NO-OP), there is no valid stuff of the page any more
in any of the the CPU caches. CPU #2 will continue somewhat later
to invalidate the PTE due to some HW delays.

In the mean time, CPU #1 detects a cache miss. It has the valid
translation in its TLB, it can re-fetch the cache line from the page.

Now CPU #2 can really invalidate the PTE and flush the TLB-s.

We can end up a cache line of the CPU #1 that has no more valid
virtual-->physical translation.

As far as I can understand, there is a contradiction:
- either we keep the valid translation while flushing, and
 the other CPU can re-fetch the data
- or we nuke the PTE first, and the flush wont work

Can someone please explain me how it is assumed to work?

Thanks,

Zoltan Menyhart

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte on ia64 take3

2007-07-26 Thread Zoltan Menyhart

I do not think changing the semantics of flush_cache_page()
is a good idea. Please have a look at Documentation/cachetlb.txt.
This is for virtual address tagged caches.

The doc. does not say much about flush_icache_page().
It is used in:

install_page(): can release any previously existing mapping
(similarly: for virtual address tagged I-cache)

On the other hand, I do not see why it is used in these two
cases, I think the virtual address tagged I-cache must have been
flushed when the old PTE of the page was removed:

do_no_page():
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);

do_swap_page():
flush_icache_page(vma, page);

Anyway, it is a NO-OP on the machines with physical address
tagged caches.

Sure, the names could have been chosen more carefully.

Unless you have a very strong reason, do not change their
semantics (would require updates for the other architectures).

I can agree:
- lazy_mmu_prot_update() is not the right place for a flush
- to flush I-cache before inserting PTE (at least for the
 machines with physical address tagged caches)

Can we have a separate, specific exec_flush_phtag_icache_page()
that is to be called before inserting a PTE with the exec bit on
for physical address tagged I-caches) ?
It is defined in asm-generic/cacheflush.h as NO-OP.

Or exec_flush_icache_page_before_set_pte() ?

Thanks,

Zoltan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte on ia64 take3

2007-07-26 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Thank you for advise. Hmm..how about this ?
==
 /*
  * synchronize icache and dcache if the hardware doesn't do it automatically
  * and the page is executable.
  */
static inline arch_sync_icache_dcache(pte_t pte, page) {

}


Why not?

I just wanted to see if you change the existing interface or not.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] flush icache before set_pte on ia64 take3

2007-07-26 Thread Zoltan Menyhart

I do not think changing the semantics of flush_cache_page()
is a good idea. Please have a look at Documentation/cachetlb.txt.
This is for virtual address tagged caches.

The doc. does not say much about flush_icache_page().
It is used in:

install_page(): can release any previously existing mapping
(similarly: for virtual address tagged I-cache)

On the other hand, I do not see why it is used in these two
cases, I think the virtual address tagged I-cache must have been
flushed when the old PTE of the page was removed:

do_no_page():
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);

do_swap_page():
flush_icache_page(vma, page);

Anyway, it is a NO-OP on the machines with physical address
tagged caches.

Sure, the names could have been chosen more carefully.

Unless you have a very strong reason, do not change their
semantics (would require updates for the other architectures).

I can agree:
- lazy_mmu_prot_update() is not the right place for a flush
- to flush I-cache before inserting PTE (at least for the
 machines with physical address tagged caches)

Can we have a separate, specific exec_flush_phtag_icache_page()
that is to be called before inserting a PTE with the exec bit on
for physical address tagged I-caches) ?
It is defined in asm-generic/cacheflush.h as NO-OP.

Or exec_flush_icache_page_before_set_pte() ?

Thanks,

Zoltan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Race condition around flush_cache_page() ?

2007-07-26 Thread Zoltan Menyhart

Documentation/cachetlb.txt says:

In general, when Linux is changing an existing virtual--physical
mapping to a new value, the sequence will be ...:

  flush_cache_page(vma, addr, pfn);
  set_pte(pte_pointer, new_pte_val);
  flush_tlb_page(vma, addr);

The cache level flush will always be first, because this allows
us to properly handle systems whose caches are strict and require
a virtual--physical translation to exist for a virtual address
when that virtual address is flushed from the cache.

Let's assume we've got a multi threaded application, with 2 threads,
each of them bound to its own CPU.
Let's assume the CPU caches are tagged by virtual address and the 
virtual--physical translation must remain valid while we flush the

cache.

CPU #1 is in a loop that accesses the stuff on a given virtual page.

CPU #2 executes the sequence above, trying to tear down the mapping
of the same page. Once CPU#2 has completed flush_cache_page()
(that is not a NO-OP), there is no valid stuff of the page any more
in any of the the CPU caches. CPU #2 will continue somewhat later
to invalidate the PTE due to some HW delays.

In the mean time, CPU #1 detects a cache miss. It has the valid
translation in its TLB, it can re-fetch the cache line from the page.

Now CPU #2 can really invalidate the PTE and flush the TLB-s.

We can end up a cache line of the CPU #1 that has no more valid
virtual--physical translation.

As far as I can understand, there is a contradiction:
- either we keep the valid translation while flushing, and
 the other CPU can re-fetch the data
- or we nuke the PTE first, and the flush wont work

Can someone please explain me how it is assumed to work?

Thanks,

Zoltan Menyhart

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Hmm...but the current code flushes the page. just do it in "lazy" way.
much difference ?


I agree the current code flushes the I-cache for all kinds of file
systems (for PTEs with the exec bit on).

The error is that it does it after the PTE is written.

In addition, I wanted to optimize it to gain a few %.
Apparently this idea is not much welcome.

I can agree that flushing the I-cache (if the architecture requires it)
before setting the PTE eliminates the error.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


A bit new idea.  How about this ?
==
- Set PG_arch_1 if  "icache is *not* coherent"


page-flags.h:
* PG_arch_1 is an architecture specific page state bit.  The generic code
* guarantees that this bit is cleared for a page when it first is entered into
* the page cache.

I do not think you can easily change it.
I can agree, making nfs_readpage() call an architecture dependent service
is not an easy stuff either. :-)


- make flush_dcache_page() to be empty func.
- For Montecito, add kmap_atomic(). This function just set PG_arch1.


kmap_atomic() is used at several places. Do you want to set
PG_arch1, everywhere kmap_atomic() is called?


  Then, "the page which is copied by the kernel" is marked as "not icache coherent 
page"
- icache_flush_page() just flushes a page which has PG_arch_1.
- Anonymous page is always has PG_arch_1. Tkae care of Copy-On-Write.


You can allocate (even in user mode) an anonymous page, hand-create or
read() in some code from a file, and mprotect(, EXEC)-it. The page has
to become I-cache coherent.

I am not sure I can really understand your proposal.
I cannot see how the compatibility to the existing code is made sure.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


But is it too costly that flushing icache page only if a page is newly
installed into the system (PG_arch1) && it is mapped as executable ?


Well it was a bit long time ago, I measured on a Tiger box with
CPUs of 1.3 GHz:

Flushing a page of 64 Kbytes, with modified data in D-cache
(it's slower that not having modified data in the D-cache):

13.1 ... 14.7 usec.

You may have quicker machines, but having more CPUs or a NUMA architecture
can slow it down considerably:
- more CPUs have to agree that that's the moment to carry out a flush
- NUMA adds delay

We may have, say 1 Gbyte / sec local i/o activity (using some RAIDs).
Assume a few % of this 1 Gbyte is the program execution, or program swap in.
It gives some hundreds of new exec pages / sec =>
some msec-s can be lost each sec.

I can agree that it should not be a big deal :-)


I don't want to leak this (stupid) corner case to the file system layer.
Hmm...can't we do clever flushing (like your idea) in VM layer ?


As the VM layer is designed to be independent of the page read in stuff...

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

Back to do_no_page():
if the new PTE includes the exec bit and PG_arch_1 is set,
the page has to be flushed from the I-cache before the PTE is
made globally visible.


Sorry, I wanted to say:

if the new PTE includes the exec bit and PG_arch_1 is NOT set

Thanks,

Zoltan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Then, what should I do more for fixing this SIGILL problem ?

-Kame


I can think of a relatively cheap solution:

New pages are allocated with the bit PG_arch_1 off, see
page_cache_read() ... prep_new_page(), i.e. the I-cache is
not coherent with the D-cache.

page_cache_read() should add a macro, say:

ARCH_PREP_PAGE_BEFORE_READ(page);

before actually calling mapping->a_ops->readpage(file, page).

This macro can be for ia64 something like:

do { if (CPU_has_split_L2_I_cache) set_bit(PG_arch_1, >flags); }

and empty for the the architectures non concerned.

The file systems which are identified not to use HW tools to
avoid split I-cache incoherency, e.g. nfs_readpage(), are required
to add a macro, say:

ARCH_CHECK_READ_PAGE_COHERENCY(page);

This macro can be for ia64:

do { if (CPU_has_split_L2_I_cache) clear_bit(PG_arch_1, >flags); }

and empty for the the architectures non concerned.

Back to do_no_page():
if the new PTE includes the exec bit and PG_arch_1 is set,
the page has to be flushed from the I-cache before the PTE is
made globally visible.

File systems using local block devices with DMA are considered
to be safe, with the exceptions of the bounce buffers.

When you copy into the destination page, another macro should be
added, say:

ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec);

#define ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec) \
ARCH_CHECK_READ_PAGE_COHERENCY(bio_vec->bv_page)

Remote DMA based network file systems, e.g. Lustre on Quadrics,
Infiniband are also considered to be safe.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Then, what should I do more for fixing this SIGILL problem ?

-Kame


I can think of a relatively cheap solution:

New pages are allocated with the bit PG_arch_1 off, see
page_cache_read() ... prep_new_page(), i.e. the I-cache is
not coherent with the D-cache.

page_cache_read() should add a macro, say:

ARCH_PREP_PAGE_BEFORE_READ(page);

before actually calling mapping-a_ops-readpage(file, page).

This macro can be for ia64 something like:

do { if (CPU_has_split_L2_I_cache) set_bit(PG_arch_1, page-flags); }

and empty for the the architectures non concerned.

The file systems which are identified not to use HW tools to
avoid split I-cache incoherency, e.g. nfs_readpage(), are required
to add a macro, say:

ARCH_CHECK_READ_PAGE_COHERENCY(page);

This macro can be for ia64:

do { if (CPU_has_split_L2_I_cache) clear_bit(PG_arch_1, page-flags); }

and empty for the the architectures non concerned.

Back to do_no_page():
if the new PTE includes the exec bit and PG_arch_1 is set,
the page has to be flushed from the I-cache before the PTE is
made globally visible.

File systems using local block devices with DMA are considered
to be safe, with the exceptions of the bounce buffers.

When you copy into the destination page, another macro should be
added, say:

ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec);

#define ARCH_CHECK_BOUNCE_READ_COHERENCY(bio_vec) \
ARCH_CHECK_READ_PAGE_COHERENCY(bio_vec-bv_page)

Remote DMA based network file systems, e.g. Lustre on Quadrics,
Infiniband are also considered to be safe.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

Back to do_no_page():
if the new PTE includes the exec bit and PG_arch_1 is set,
the page has to be flushed from the I-cache before the PTE is
made globally visible.


Sorry, I wanted to say:

if the new PTE includes the exec bit and PG_arch_1 is NOT set

Thanks,

Zoltan

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


But is it too costly that flushing icache page only if a page is newly
installed into the system (PG_arch1)  it is mapped as executable ?


Well it was a bit long time ago, I measured on a Tiger box with
CPUs of 1.3 GHz:

Flushing a page of 64 Kbytes, with modified data in D-cache
(it's slower that not having modified data in the D-cache):

13.1 ... 14.7 usec.

You may have quicker machines, but having more CPUs or a NUMA architecture
can slow it down considerably:
- more CPUs have to agree that that's the moment to carry out a flush
- NUMA adds delay

We may have, say 1 Gbyte / sec local i/o activity (using some RAIDs).
Assume a few % of this 1 Gbyte is the program execution, or program swap in.
It gives some hundreds of new exec pages / sec =
some msec-s can be lost each sec.

I can agree that it should not be a big deal :-)


I don't want to leak this (stupid) corner case to the file system layer.
Hmm...can't we do clever flushing (like your idea) in VM layer ?


As the VM layer is designed to be independent of the page read in stuff...

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


A bit new idea.  How about this ?
==
- Set PG_arch_1 if  icache is *not* coherent


page-flags.h:
* PG_arch_1 is an architecture specific page state bit.  The generic code
* guarantees that this bit is cleared for a page when it first is entered into
* the page cache.

I do not think you can easily change it.
I can agree, making nfs_readpage() call an architecture dependent service
is not an easy stuff either. :-)


- make flush_dcache_page() to be empty func.
- For Montecito, add kmap_atomic(). This function just set PG_arch1.


kmap_atomic() is used at several places. Do you want to set
PG_arch1, everywhere kmap_atomic() is called?


  Then, the page which is copied by the kernel is marked as not icache coherent 
page
- icache_flush_page() just flushes a page which has PG_arch_1.
- Anonymous page is always has PG_arch_1. Tkae care of Copy-On-Write.


You can allocate (even in user mode) an anonymous page, hand-create or
read() in some code from a file, and mprotect(, EXEC)-it. The page has
to become I-cache coherent.

I am not sure I can really understand your proposal.
I cannot see how the compatibility to the existing code is made sure.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-19 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:


Hmm...but the current code flushes the page. just do it in lazy way.
much difference ?


I agree the current code flushes the I-cache for all kinds of file
systems (for PTEs with the exec bit on).

The error is that it does it after the PTE is written.

In addition, I wanted to optimize it to gain a few %.
Apparently this idea is not much welcome.

I can agree that flushing the I-cache (if the architecture requires it)
before setting the PTE eliminates the error.

Thanks,

Zoltan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-06 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:

 Note1: icache flush is called only when VM_EXEC flag is on and 
PG_arch_1 is not set.


If you have not got the page in the cache, then the new page will
be allocated with PG_arch_1 bit off.
You are going to flush pages which are read by HW DMA, i.e. the L2I
of Montecito does not keep old lines for those pages anyway.

...->a_ops->readpage() of "L2I safe" file systems should set PG_arch_1
if the CPU is ia64 and it has got separate L2I.

On the other hand, arch. independent file systems should not play with
PG_arch_1.
The base kernel should export a macro for the file systems...

Thanks,

Zoltan Menyhart


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUGFIX]{PATCH] flush icache on ia64 take2

2007-07-06 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:

 Note1: icache flush is called only when VM_EXEC flag is on and 
PG_arch_1 is not set.


If you have not got the page in the cache, then the new page will
be allocated with PG_arch_1 bit off.
You are going to flush pages which are read by HW DMA, i.e. the L2I
of Montecito does not keep old lines for those pages anyway.

...-a_ops-readpage() of L2I safe file systems should set PG_arch_1
if the CPU is ia64 and it has got separate L2I.

On the other hand, arch. independent file systems should not play with
PG_arch_1.
The base kernel should export a macro for the file systems...

Thanks,

Zoltan Menyhart


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-07-05 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:

On Wed, 04 Jul 2007 16:24:38 +0200
Zoltan Menyhart <[EMAIL PROTECTED]> wrote:


Machines star up whit bit 5 = 0, reading instruction pages via
NFS has to flush them from L2I.



In our test, we confirmed that this can be fixed by flushing L2I just before 
SetPageUptodate() in NFS.


I can agree.
We can be more permissive: it can be done anywhere after the new
data is put in place and before nfs_readpage() or nfs_readpages()
returns.

I saw your patch http://marc.info/?l=linux-mm=118352909826277=2
that modifies e.g. mm/memory.c and not the NFS layer.

Have you proposed a patch against the NFS layer?


I was wondering if instead of modifying do_no_page() and Co., should
not we make nfs_readpage() be DMA-like?
(No possible regression for most of the page I/O-s.)
I.e. it should be the responsibility of a file system to make sure it
supports instruction pages correctly. The base kernel should provide
such file systems with an architecture dependent macro...



IMHO, for example, race in cooy-on-write  (was fixed by Tony Luck) has to be
fixed by MemoryManagement layer.


I can agree.
Note that it has not got much performance impact from our point
of view because there are not too many COW paves with executable code...


And only a race in do_no_page() seems to be able to be fixed by FS layer.


If it were do_no_page() that would fix the problem, then it should know
where the page comes from in order not to flush the I-cache in vain.
do_no_page() just calls vma->vm_ops->nopage() that goes down to
fs_op->readpage() / fs_op->readpages().

On the other hand, these latter routines do not know why they fetch
the page(s). Note that a page can be mapped more than one times, with
different permission bits.
As a consequence, these routines will flush the I-cache in vain in
most of the cases.

I prefer a performance impact to some non DMA based file systems
and adding nothing to the path for the majority of the cases.


BTW, can we know whether a page is filled by DMA or not  ?


Well, the file systems based on block devices use DMA transfer
(with the exception of using bounce buffers, in that case it is the
responsibility of the bounce buffering layer to flush the I-cache).

Network based file systems require revision and update...

Note that only the processors with separate L2I require
attention, the L1I is guaranteed to be flushed when you change
the corresponding TBL entry.

The base kernel should provide a macro / service (that cares for the
processor model) for the file systems.

Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-07-05 Thread Zoltan Menyhart

KAMEZAWA Hiroyuki wrote:

On Wed, 04 Jul 2007 16:24:38 +0200
Zoltan Menyhart [EMAIL PROTECTED] wrote:


Machines star up whit bit 5 = 0, reading instruction pages via
NFS has to flush them from L2I.



In our test, we confirmed that this can be fixed by flushing L2I just before 
SetPageUptodate() in NFS.


I can agree.
We can be more permissive: it can be done anywhere after the new
data is put in place and before nfs_readpage() or nfs_readpages()
returns.

I saw your patch http://marc.info/?l=linux-mmm=118352909826277w=2
that modifies e.g. mm/memory.c and not the NFS layer.

Have you proposed a patch against the NFS layer?


I was wondering if instead of modifying do_no_page() and Co., should
not we make nfs_readpage() be DMA-like?
(No possible regression for most of the page I/O-s.)
I.e. it should be the responsibility of a file system to make sure it
supports instruction pages correctly. The base kernel should provide
such file systems with an architecture dependent macro...



IMHO, for example, race in cooy-on-write  (was fixed by Tony Luck) has to be
fixed by MemoryManagement layer.


I can agree.
Note that it has not got much performance impact from our point
of view because there are not too many COW paves with executable code...


And only a race in do_no_page() seems to be able to be fixed by FS layer.


If it were do_no_page() that would fix the problem, then it should know
where the page comes from in order not to flush the I-cache in vain.
do_no_page() just calls vma-vm_ops-nopage() that goes down to
fs_op-readpage() / fs_op-readpages().

On the other hand, these latter routines do not know why they fetch
the page(s). Note that a page can be mapped more than one times, with
different permission bits.
As a consequence, these routines will flush the I-cache in vain in
most of the cases.

I prefer a performance impact to some non DMA based file systems
and adding nothing to the path for the majority of the cases.


BTW, can we know whether a page is filled by DMA or not  ?


Well, the file systems based on block devices use DMA transfer
(with the exception of using bounce buffers, in that case it is the
responsibility of the bounce buffering layer to flush the I-cache).

Network based file systems require revision and update...

Note that only the processors with separate L2I require
attention, the L1I is guaranteed to be flushed when you change
the corresponding TBL entry.

The base kernel should provide a macro / service (that cares for the
processor model) for the file systems.

Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-07-04 Thread Zoltan Menyhart

Could you please confirm that I understand correctly what is in the:

Dual-Core Update to the Intel Itanium 2 Processor Reference Manual...

"2.3.3.2 L2 Caches
...
Any coherence request to identify whether a cache line is in the processor
will invalidate that line from the L2I cache."

This makes sure that the DMAs invalidate the L2L cache.

"2.7.4 Instruction Cache Coherence Optimization
Coherence requests of the L1I and L2I caches will invalidate the line if
it is in the cache. Montecito allows instruction requests on the system
interface to be filtered such that they will not initiate coherence
requests of the L1I and L2I caches. This will allow instructions to be
cached at the L1I and L2I levels across multiple processors in a coherent
domain. This optimization is enabled by default, but may be disabled by
PAL_SET_PROC_FEATURES bit 5 of the Montecito feature_set (18)."

Machines star up whit bit 5 = 0, reading instruction pages via
NFS has to flush them from L2I.


I was wondering if instead of modifying do_no_page() and Co., should
not we make nfs_readpage() be DMA-like?
(No possible regression for most of the page I/O-s.)
I.e. it should be the responsibility of a file system to make sure it
supports instruction pages correctly. The base kernel should provide
such file systems with an architecture dependent macro...

Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Fw: [PATCH] ia64: race flushing icache in do_no_page path

2007-07-04 Thread Zoltan Menyhart

Could you please confirm that I understand correctly what is in the:

Dual-Core Update to the Intel Itanium 2 Processor Reference Manual...

2.3.3.2 L2 Caches
...
Any coherence request to identify whether a cache line is in the processor
will invalidate that line from the L2I cache.

This makes sure that the DMAs invalidate the L2L cache.

2.7.4 Instruction Cache Coherence Optimization
Coherence requests of the L1I and L2I caches will invalidate the line if
it is in the cache. Montecito allows instruction requests on the system
interface to be filtered such that they will not initiate coherence
requests of the L1I and L2I caches. This will allow instructions to be
cached at the L1I and L2I levels across multiple processors in a coherent
domain. This optimization is enabled by default, but may be disabled by
PAL_SET_PROC_FEATURES bit 5 of the Montecito feature_set (18).

Machines star up whit bit 5 = 0, reading instruction pages via
NFS has to flush them from L2I.


I was wondering if instead of modifying do_no_page() and Co., should
not we make nfs_readpage() be DMA-like?
(No possible regression for most of the page I/O-s.)
I.e. it should be the responsibility of a file system to make sure it
supports instruction pages correctly. The base kernel should provide
such file systems with an architecture dependent macro...

Thanks,

Zoltan Menyhart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


copy_one_pte()

2007-03-13 Thread Zoltan Menyhart

I had a look at copy_one_pte().
I cannot see any ioproc_update_page() call, not even for the COW pages.
Is it intentional?

We can live with a COW page for a considerably long time.
How could the IO-PROC. know that a process-ID / user virt. addr. pair
refers to the same page?

The comment above ioproc_update_page() says that every time when a PTE
is created / modified...

Thanks,

Zoltan Menyhart

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


copy_one_pte()

2007-03-13 Thread Zoltan Menyhart

I had a look at copy_one_pte().
I cannot see any ioproc_update_page() call, not even for the COW pages.
Is it intentional?

We can live with a COW page for a considerably long time.
How could the IO-PROC. know that a process-ID / user virt. addr. pair
refers to the same page?

The comment above ioproc_update_page() says that every time when a PTE
is created / modified...

Thanks,

Zoltan Menyhart

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Mprotect needs arch hook for updated PTE settings

2005-03-16 Thread Zoltan Menyhart
An application should not change the protection of its _own_ text region
without knowing well the requirements of the given architecture.
I do not think we should add anything into the kernel for this case.
I did see /lib/ld-linux-ia64.so.* changing the protection of the text
segment of the _victim_ application, when it linked the library references.
ld-linux-ia64.so.* changes the protection for the whole text segment
(otherwise, as the protection is per VMA, it would result in a VMA
fragmentation).
The text segment can be huge. There is no reason to flush all the text
segment every time when ld-linux-ia64.so.* patches an instruction and
changes the protection.
I think the solution should consist of these two measures:
1. Let's say that if an VMA is "executable", then it remains "executable"
  for ever, i.e. the mprotect() keeps the PROT_EXEC bit.
  As a result, if a page is faulted in for this VMA in the mean time, the
  old good mechanism makes sure that the I-caches are flushed.
2. Let's modify ld-linux-.so.*: having patched an instruction, it
  should take the appropriate, architecture dependent measure, e.g. for
  ia64, it should issue an "fc" instruction.
  (Who cares for a debugger ? It should know what it does ;-).)
I think there is no need for any extra flushes.
Thanks,
Zoltan Menyhart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Mprotect needs arch hook for updated PTE settings

2005-03-16 Thread Zoltan Menyhart
An application should not change the protection of its _own_ text region
without knowing well the requirements of the given architecture.
I do not think we should add anything into the kernel for this case.
I did see /lib/ld-linux-ia64.so.* changing the protection of the text
segment of the _victim_ application, when it linked the library references.
ld-linux-ia64.so.* changes the protection for the whole text segment
(otherwise, as the protection is per VMA, it would result in a VMA
fragmentation).
The text segment can be huge. There is no reason to flush all the text
segment every time when ld-linux-ia64.so.* patches an instruction and
changes the protection.
I think the solution should consist of these two measures:
1. Let's say that if an VMA is executable, then it remains executable
  for ever, i.e. the mprotect() keeps the PROT_EXEC bit.
  As a result, if a page is faulted in for this VMA in the mean time, the
  old good mechanism makes sure that the I-caches are flushed.
2. Let's modify ld-linux-arch.so.*: having patched an instruction, it
  should take the appropriate, architecture dependent measure, e.g. for
  ia64, it should issue an fc instruction.
  (Who cares for a debugger ? It should know what it does ;-).)
I think there is no need for any extra flushes.
Thanks,
Zoltan Menyhart
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/