from:"Steffen Persvold"

Re: CONFIG_HOLES_IN_ZONE and memory hot plug code on x86_64

2015-08-28 Thread Steffen Persvold






On 27/08/15 22:20 , "yhlu.ker...@gmail.com on behalf of Yinghai Lu" 
 wrote:

>On Fri, Jun 26, 2015 at 4:31 PM, Steffen Persvold  wrote:
>> We’ve encountered an issue in a special case where we have a sparse E820 map 
>> [1].
>>
>> Basically the memory hotplug code is causing a “kernel paging request” BUG 
>> [2].
>
>the trace does not look like hotplug path.
>
>>
>> By instrumenting the function register_mem_sect_under_node() in 
>> drivers/base/node.c we see that it is called two times with the same struct 
>> memory_block argument :
>>
>> [1.901463] register_mem_sect_under_node: start = 80, end = 8f, nid = 0
>> [1.908129] register_mem_sect_under_node: start = 80, end = 8f, nid = 1
>
>Can you post whole log with SRAT related info?

I can probably reproduce again and get full logs when I get run time on the 
system again, but here’s some output that we saved in our internal Jira case :

[0.00] NUMA: Initialized distance table, cnt=6
[0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
0x0010-0xd7ff] -> [mem 0x-0xd7ff]
[0.00] NUMA: Node 0 [mem 0x-0xd7ff] + [mem 
0x1-0x427ff] -> [mem 0x-0x427ff]
[0.00] NODE_DATA(0) allocated [mem 0x407fe3000-0x407ff]
[0.00] NODE_DATA(1) allocated [mem 0x807fe3000-0x807ff]
[0.00] NODE_DATA(2) allocated [mem 0xc07fe3000-0xc07ff]
[0.00] NODE_DATA(3) allocated [mem 0x1007fe3000-0x1007ff]
[0.00] NODE_DATA(4) allocated [mem 0x1407fe3000-0x1407ff]
[0.00] NODE_DATA(5) allocated [mem 0x1807fdd000-0x1807ff9fff]
[0.00]  [ea00-ea00101f] PMD -> 
[8803f860-880407df] on node 0
[0.00]  [ea0010a0-ea00201f] PMD -> 
[8807f860-880807df] on node 1
[0.00]  [ea0020a0-ea00301f] PMD -> 
[880bf860-880c07df] on node 2
[0.00]  [ea0030a0-ea00401f] PMD -> 
[880ff860-881007df] on node 3
[0.00]  [ea0040a0-ea00501f] PMD -> 
[8813f860-881407df] on node 4
[0.00]  [ea0050a0-ea00601f] PMD -> 
[8817f7e0-8818075f] on node 5

If I remember correctly there was a mix of 4GB and 8GB DIMMs populated on this 
system. In addition the firmware reserved 512MByte at the end of each memory 
controllers physical range (hence the reserved ranges in the e820 map).

Note: this was with 4.1.0 vanilla so it could be obsolete now with 4.2-rc. I 
have not yet tested with your latest patches that you and Tony discussed.


Cheers,
Steffen


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: CONFIG_HOLES_IN_ZONE and memory hot plug code on x86_64

2015-08-28 Thread Steffen Persvold






On 27/08/15 22:20 , yhlu.ker...@gmail.com on behalf of Yinghai Lu 
yhlu.ker...@gmail.com on behalf of ying...@kernel.org wrote:

On Fri, Jun 26, 2015 at 4:31 PM, Steffen Persvold s...@numascale.com wrote:
 We’ve encountered an issue in a special case where we have a sparse E820 map 
 [1].

 Basically the memory hotplug code is causing a “kernel paging request” BUG 
 [2].

the trace does not look like hotplug path.


 By instrumenting the function register_mem_sect_under_node() in 
 drivers/base/node.c we see that it is called two times with the same struct 
 memory_block argument :

 [1.901463] register_mem_sect_under_node: start = 80, end = 8f, nid = 0
 [1.908129] register_mem_sect_under_node: start = 80, end = 8f, nid = 1

Can you post whole log with SRAT related info?

I can probably reproduce again and get full logs when I get run time on the 
system again, but here’s some output that we saved in our internal Jira case :

[0.00] NUMA: Initialized distance table, cnt=6
[0.00] NUMA: Node 0 [mem 0x-0x0009] + [mem 
0x0010-0xd7ff] - [mem 0x-0xd7ff]
[0.00] NUMA: Node 0 [mem 0x-0xd7ff] + [mem 
0x1-0x427ff] - [mem 0x-0x427ff]
[0.00] NODE_DATA(0) allocated [mem 0x407fe3000-0x407ff]
[0.00] NODE_DATA(1) allocated [mem 0x807fe3000-0x807ff]
[0.00] NODE_DATA(2) allocated [mem 0xc07fe3000-0xc07ff]
[0.00] NODE_DATA(3) allocated [mem 0x1007fe3000-0x1007ff]
[0.00] NODE_DATA(4) allocated [mem 0x1407fe3000-0x1407ff]
[0.00] NODE_DATA(5) allocated [mem 0x1807fdd000-0x1807ff9fff]
[0.00]  [ea00-ea00101f] PMD - 
[8803f860-880407df] on node 0
[0.00]  [ea0010a0-ea00201f] PMD - 
[8807f860-880807df] on node 1
[0.00]  [ea0020a0-ea00301f] PMD - 
[880bf860-880c07df] on node 2
[0.00]  [ea0030a0-ea00401f] PMD - 
[880ff860-881007df] on node 3
[0.00]  [ea0040a0-ea00501f] PMD - 
[8813f860-881407df] on node 4
[0.00]  [ea0050a0-ea00601f] PMD - 
[8817f7e0-8818075f] on node 5

If I remember correctly there was a mix of 4GB and 8GB DIMMs populated on this 
system. In addition the firmware reserved 512MByte at the end of each memory 
controllers physical range (hence the reserved ranges in the e820 map).

Note: this was with 4.1.0 vanilla so it could be obsolete now with 4.2-rc. I 
have not yet tested with your latest patches that you and Tony discussed.


Cheers,
Steffen


N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

CONFIG_HOLES_IN_ZONE and memory hot plug code on x86_64

2015-06-26 Thread Steffen Persvold

Hi,

We’ve encountered an issue in a special case where we have a sparse E820 map 
[1].

Basically the memory hotplug code is causing a “kernel paging request” BUG [2].

By instrumenting the function register_mem_sect_under_node() in 
drivers/base/node.c we see that it is called two times with the same struct 
memory_block argument :

[1.901463] register_mem_sect_under_node: start = 80, end = 8f, nid = 0
[1.908129] register_mem_sect_under_node: start = 80, end = 8f, nid = 1

The second call is causing paging request because the for loop in 
register_mem_sect_under_node() is scanning pfns :

for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {


and can’t find one that matches the input “nid” argument (1), which is natural 
enough because those sections does not belong to node1, but rather node0. This 
results in the for loop entering a “hole” in the pfn range which isn’t mapped.

Now, the code appears to have been designed to handle this by checking if the 
pfn really belongs to this node with the the function get_nid_for_pfn() in the 
same file :

static int get_nid_for_pfn(unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
page = pfn_to_page(pfn);
if (!page_initialized(page))
return -1;
return pfn_to_nid(pfn);
}

However, pfn_valid_within() (from include/linux/mmzone.h) is not getting a 
false return value because :

/*
 * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
 * need to check pfn validility within that MAX_ORDER_NR_PAGES block.
 * pfn_valid_within() should be used in this case; we optimise this away
 * when we have no holes within a MAX_ORDER_NR_PAGES block.
 */
#ifdef CONFIG_HOLES_IN_ZONE
#define pfn_valid_within(pfn) pfn_valid(pfn)
#else
#define pfn_valid_within(pfn) (1)
#endif


CONFIG_HOLES_IN_ZONE is not possible to set on x86_64, it is present only on 
ia64 and mips.

Is there a specific reason why CONFIG_HOLES_IN_ZONE isn’t activated on x86_64 ? 
I’ve added a patch to arch/x86/Kconfig [3] which solves this issue, however I 
guess another approach would be to figure out why 
register_mem_sect_under_node() is called with a wrong struct memory_block for 
node1

Any comments or suggestions are welcome.

PS: Even if we avoid the sparse e820 map, register_mem_sect_under_node() is 
still invoked twice with the same struct memory_block once for node0 (which 
gets a match) and once for node1. However when all the pfns are mapped, it just 
goes through the range just fine without a paging request.

Cheers,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88  Fax: +47 23 16 71 80 Skype: spersvold

[1]

[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00087fff] usable
[0.00] BIOS-e820: [mem 0x00088000-0x00089bff] reserved
[0.00] BIOS-e820: [mem 0x00089c00-0x0009ebff] usable
[0.00] BIOS-e820: [mem 0x0009ec00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e84e0-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7e5] usable
[0.00] BIOS-e820: [mem 0xd7e6e000-0xd7e6] type 9
[0.00] BIOS-e820: [mem 0xd7e7-0xd7e93fff] ACPI data
[0.00] BIOS-e820: [mem 0xd7e94000-0xd7eb] ACPI NVS
[0.00] BIOS-e820: [mem 0xd7ec-0xd7ed] reserved
[0.00] BIOS-e820: [mem 0xd7eed000-0xd7ff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xffe0-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x000407ff] usable
[0.00] BIOS-e820: [mem 0x00040800-0x000427ff] reserved
[0.00] BIOS-e820: [mem 0x00042800-0x000807ff] usable
[0.00] BIOS-e820: [mem 0x00080800-0x000827ff] reserved
[0.00] BIOS-e820: [mem 0x00082800-0x000c07ff] usable
[0.00] BIOS-e820: [mem 0x000c0800-0x000c27ff] reserved
[0.00] BIOS-e820: [mem 0x000c2800-0x001007ff] usable
[0.00] BIOS-e820: [mem 0x00100800-0x001027ff] reserved
[0.00] BIOS-e820: [mem 0x00102800-0x001407ff] usable
[0.00] BIOS-e820: [mem 0x00140800-0x001427ff] reserved
[0.00] BIOS-e820: [mem 0x00142800-0x001807ff] usable
[0.00] BIOS-e820: [mem 0x00180800-0x001827ff] reserved
[0.00] BIOS-e820: [mem 0x00fd-0x00ff] reserved
[0.00] BIOS-e820: [mem 0x3f00-0x3fff] reserved

[2]

[1.915002] BUG: unable to handle kernel paging request at ea0010200020
[1.922

CONFIG_HOLES_IN_ZONE and memory hot plug code on x86_64

2015-06-26 Thread Steffen Persvold

Hi,

We’ve encountered an issue in a special case where we have a sparse E820 map 
[1].

Basically the memory hotplug code is causing a “kernel paging request” BUG [2].

By instrumenting the function register_mem_sect_under_node() in 
drivers/base/node.c we see that it is called two times with the same struct 
memory_block argument :

[1.901463] register_mem_sect_under_node: start = 80, end = 8f, nid = 0
[1.908129] register_mem_sect_under_node: start = 80, end = 8f, nid = 1

The second call is causing paging request because the for loop in 
register_mem_sect_under_node() is scanning pfns :

for (pfn = sect_start_pfn; pfn = sect_end_pfn; pfn++) {


and can’t find one that matches the input “nid” argument (1), which is natural 
enough because those sections does not belong to node1, but rather node0. This 
results in the for loop entering a “hole” in the pfn range which isn’t mapped.

Now, the code appears to have been designed to handle this by checking if the 
pfn really belongs to this node with the the function get_nid_for_pfn() in the 
same file :

static int get_nid_for_pfn(unsigned long pfn)
{
struct page *page;

if (!pfn_valid_within(pfn))
return -1;
page = pfn_to_page(pfn);
if (!page_initialized(page))
return -1;
return pfn_to_nid(pfn);
}

However, pfn_valid_within() (from include/linux/mmzone.h) is not getting a 
false return value because :

/*
 * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
 * need to check pfn validility within that MAX_ORDER_NR_PAGES block.
 * pfn_valid_within() should be used in this case; we optimise this away
 * when we have no holes within a MAX_ORDER_NR_PAGES block.
 */
#ifdef CONFIG_HOLES_IN_ZONE
#define pfn_valid_within(pfn) pfn_valid(pfn)
#else
#define pfn_valid_within(pfn) (1)
#endif


CONFIG_HOLES_IN_ZONE is not possible to set on x86_64, it is present only on 
ia64 and mips.

Is there a specific reason why CONFIG_HOLES_IN_ZONE isn’t activated on x86_64 ? 
I’ve added a patch to arch/x86/Kconfig [3] which solves this issue, however I 
guess another approach would be to figure out why 
register_mem_sect_under_node() is called with a wrong struct memory_block for 
node1

Any comments or suggestions are welcome.

PS: Even if we avoid the sparse e820 map, register_mem_sect_under_node() is 
still invoked twice with the same struct memory_block once for node0 (which 
gets a match) and once for node1. However when all the pfns are mapped, it just 
goes through the range just fine without a paging request.

Cheers,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88  Fax: +47 23 16 71 80 Skype: spersvold

[1]

[0.00] e820: BIOS-provided physical RAM map:
[0.00] BIOS-e820: [mem 0x-0x00087fff] usable
[0.00] BIOS-e820: [mem 0x00088000-0x00089bff] reserved
[0.00] BIOS-e820: [mem 0x00089c00-0x0009ebff] usable
[0.00] BIOS-e820: [mem 0x0009ec00-0x0009] reserved
[0.00] BIOS-e820: [mem 0x000e84e0-0x000f] reserved
[0.00] BIOS-e820: [mem 0x0010-0xd7e5] usable
[0.00] BIOS-e820: [mem 0xd7e6e000-0xd7e6] type 9
[0.00] BIOS-e820: [mem 0xd7e7-0xd7e93fff] ACPI data
[0.00] BIOS-e820: [mem 0xd7e94000-0xd7eb] ACPI NVS
[0.00] BIOS-e820: [mem 0xd7ec-0xd7ed] reserved
[0.00] BIOS-e820: [mem 0xd7eed000-0xd7ff] reserved
[0.00] BIOS-e820: [mem 0xe000-0xefff] reserved
[0.00] BIOS-e820: [mem 0xffe0-0x] reserved
[0.00] BIOS-e820: [mem 0x0001-0x000407ff] usable
[0.00] BIOS-e820: [mem 0x00040800-0x000427ff] reserved
[0.00] BIOS-e820: [mem 0x00042800-0x000807ff] usable
[0.00] BIOS-e820: [mem 0x00080800-0x000827ff] reserved
[0.00] BIOS-e820: [mem 0x00082800-0x000c07ff] usable
[0.00] BIOS-e820: [mem 0x000c0800-0x000c27ff] reserved
[0.00] BIOS-e820: [mem 0x000c2800-0x001007ff] usable
[0.00] BIOS-e820: [mem 0x00100800-0x001027ff] reserved
[0.00] BIOS-e820: [mem 0x00102800-0x001407ff] usable
[0.00] BIOS-e820: [mem 0x00140800-0x001427ff] reserved
[0.00] BIOS-e820: [mem 0x00142800-0x001807ff] usable
[0.00] BIOS-e820: [mem 0x00180800-0x001827ff] reserved
[0.00] BIOS-e820: [mem 0x00fd-0x00ff] reserved
[0.00] BIOS-e820: [mem 0x3f00-0x3fff] reserved

[2]

[1.915002] BUG: unable to handle kernel paging request at ea0010200020
[1.922003

RFC: Additions to APIC driver

2015-06-25 Thread Steffen Persvold

Hi,

We’re preparing our APIC driver (arch/x86/kernel/apic/apic_numachip.c) with 
next-gen hardware support and in that process I have a question on what the 
cleanest approach would be.

Both current generation and next generation chips will share a lot of similar 
code, but some of the core functionality is slightly different (such as the 
address to which you communicate with the APIC ICR to send IPIs, how to derive 
APIC IDs etc.).

The way I see it, we have few alternatives :

1) Create a new arc/x86/kernel/apic/apic_numachip2.c (and corresponding entry 
in the Makefile) which has a new “struct apic” with function pointers to the 
next-gen specific code. The new APIC driver would still only need 
CONFIG_X86_NUMACHIP to be compiled.

2) Modify the existing apic_numachip.c to recognise the different HW 
generations (trivial) and use function pointers to differentiate the IPI send 
calls (among other things), but use the *same* “struct apic” for both (the 
function pointers referenced in “struct apic” would need a new indirection 
level to differentiate between hardware revs).

3) Have two different “struct apic” entries in the existing apic_numachip.c 
source file, with separate oem_madt check functions etc. This would only be 
marginally different than 1) as far as implementation and code duplication 
goes, but it would be contained to one C source file and object file (silly 
question, maybe: would the apic_driver enumeration even work if it’s all in the 
same object file?)

Any insight into this from the great minds behind this would be highly 
appreciated.

Kind regards,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88  Fax: +47 23 16 71 80 Skype: spersvold

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC: Additions to APIC driver

2015-06-25 Thread Steffen Persvold

Hi,

We’re preparing our APIC driver (arch/x86/kernel/apic/apic_numachip.c) with 
next-gen hardware support and in that process I have a question on what the 
cleanest approach would be.

Both current generation and next generation chips will share a lot of similar 
code, but some of the core functionality is slightly different (such as the 
address to which you communicate with the APIC ICR to send IPIs, how to derive 
APIC IDs etc.).

The way I see it, we have few alternatives :

1) Create a new arc/x86/kernel/apic/apic_numachip2.c (and corresponding entry 
in the Makefile) which has a new “struct apic” with function pointers to the 
next-gen specific code. The new APIC driver would still only need 
CONFIG_X86_NUMACHIP to be compiled.

2) Modify the existing apic_numachip.c to recognise the different HW 
generations (trivial) and use function pointers to differentiate the IPI send 
calls (among other things), but use the *same* “struct apic” for both (the 
function pointers referenced in “struct apic” would need a new indirection 
level to differentiate between hardware revs).

3) Have two different “struct apic” entries in the existing apic_numachip.c 
source file, with separate oem_madt check functions etc. This would only be 
marginally different than 1) as far as implementation and code duplication 
goes, but it would be contained to one C source file and object file (silly 
question, maybe: would the apic_driver enumeration even work if it’s all in the 
same object file?)

Any insight into this from the great minds behind this would be highly 
appreciated.

Kind regards,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88  Fax: +47 23 16 71 80 Skype: spersvold

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/5] x86/PCI: Support additional MMIO range capabilities

2014-04-29 Thread Steffen Persvold

On 29 Apr 2014, at 3:20 , Borislav Petkov  wrote:

> On Tue, Apr 29, 2014 at 09:33:09AM +0200, Andreas Herrmann wrote:
>> I am sure, it's because some server systems had MMIO ECS access not
>> enabled in BIOS. I can't remember which systems were affected.
> 
> Ok, now AMD people: what's the story with IO ECS, can we assume that on
> everything after F10h, BIOS has a sensible MCFG and we can limit this to
> F10h only? I like Bjorn's idea but we need to make sure a working MCFG
> is ubiquitous.
> 
> Which begs the real question: Suravee, why are you even touching IO ECS
> provided F15h and later have a MCFG? Or, do they?
> 

Our experience with this is that Fam10h and later have a very well working MCFG 
setup, earlier generations not so much (hence IO ECS was needed).

Cheers,
Steffen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 2/5] x86/PCI: Support additional MMIO range capabilities

2014-04-29 Thread Steffen Persvold

On 29 Apr 2014, at 3:20 , Borislav Petkov b...@suse.de wrote:

 On Tue, Apr 29, 2014 at 09:33:09AM +0200, Andreas Herrmann wrote:
 I am sure, it's because some server systems had MMIO ECS access not
 enabled in BIOS. I can't remember which systems were affected.
 
 Ok, now AMD people: what's the story with IO ECS, can we assume that on
 everything after F10h, BIOS has a sensible MCFG and we can limit this to
 F10h only? I like Bjorn's idea but we need to make sure a working MCFG
 is ubiquitous.
 
 Which begs the real question: Suravee, why are you even touching IO ECS
 provided F15h and later have a MCFG? Or, do they?
 

Our experience with this is that Fam10h and later have a very well working MCFG 
setup, earlier generations not so much (hence IO ECS was needed).

Cheers,
Steffen

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/9/2013 12:24 PM, Borislav Petkov wrote:

On Tue, Apr 09, 2013 at 11:45:44AM +0200, Steffen Persvold wrote:

Hmm, yes of course. This of course breaks on our slave servers when
the shared mechanism doesn't work properly (i.e NB not visible). Then
all cores gets individual kobjects and there can be discrepancies
between what the hardware is programmed to and what is reflected in
/sys on some cores..


Hold on, are you saying you have cores with an invisible NB? How does
that even work? Or is it only invisible to sw?


only invisible to the kernel because the multi-pci-domains isn't working 
pre 3.9 on our architecture.


cheers,
Steffen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/9/2013 11:38 AM, Borislav Petkov wrote:

On Tue, Apr 09, 2013 at 11:25:16AM +0200, Steffen Persvold wrote:

Why not let all cores just create their individual kobject and skip
this "shared" nb->bank4 concept ? Any disadvantage to that (apart from
the obvious storage bloat?).


Well, bank4 is shared across cores on the northbridge in *hardware*.


Well, yes I was aware of that :)


So it is only logical to represent the hardware layout correctly in
software.

Also, if you want to configure any settings over one core's sysfs nodes,
you want those to be visible across all cores automagically:


Hmm, yes of course. This of course breaks on our slave servers when the 
shared mechanism doesn't work properly (i.e NB not visible). Then all 
cores gets individual kobjects and there can be discrepancies between 
what the hardware is programmed to and what is reflected in /sys on some 
cores..


Ok, we go with our first approach to not create MC4 at all if NB isn't 
visible.


We'll redo the patch against the tip:x86/ras branch.

Cheers,
Steffen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/4/2013 9:07 PM, Borislav Petkov wrote:

On Thu, Apr 04, 2013 at 08:05:46PM +0200, Steffen Persvold wrote:

It made more sense (to me) to skip the creation of MC4 all together
if you can't find the matching northbridge since you can't reliably
do the dec_and_test() reference counting on the shared bank when you
don't have the common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?


No, actually *this* explanation should've been in the commit message.
You numascale people do crazy things with the hardware :) so explaining
yourself more verbosely is an absolute must if anyone is to understand
why you're changing the code.



Boris,

A question came up. Why have this "shared" bank concept for the kobjects 
at all ? What's the advantage ? Before our patch, when running on our 
architecture but without pci domains for "slave" servers, everything was 
working fine except the de-allocation oops due to the NULL pointer when 
offlining cores.


Why not let all cores just create their individual kobject and skip this 
"shared" nb->bank4 concept ? Any disadvantage to that (apart from the 
obvious storage bloat?).


Cheers,
Steffen


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/4/2013 9:07 PM, Borislav Petkov wrote:

On Thu, Apr 04, 2013 at 08:05:46PM +0200, Steffen Persvold wrote:

It made more sense (to me) to skip the creation of MC4 all together
if you can't find the matching northbridge since you can't reliably
do the dec_and_test() reference counting on the shared bank when you
don't have the common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?


No, actually *this* explanation should've been in the commit message.
You numascale people do crazy things with the hardware :) so explaining
yourself more verbosely is an absolute must if anyone is to understand
why you're changing the code.



Boris,

A question came up. Why have this shared bank concept for the kobjects 
at all ? What's the advantage ? Before our patch, when running on our 
architecture but without pci domains for slave servers, everything was 
working fine except the de-allocation oops due to the NULL pointer when 
offlining cores.


Why not let all cores just create their individual kobject and skip this 
shared nb-bank4 concept ? Any disadvantage to that (apart from the 
obvious storage bloat?).


Cheers,
Steffen


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/9/2013 11:38 AM, Borislav Petkov wrote:

On Tue, Apr 09, 2013 at 11:25:16AM +0200, Steffen Persvold wrote:

Why not let all cores just create their individual kobject and skip
this shared nb-bank4 concept ? Any disadvantage to that (apart from
the obvious storage bloat?).


Well, bank4 is shared across cores on the northbridge in *hardware*.


Well, yes I was aware of that :)


So it is only logical to represent the hardware layout correctly in
software.

Also, if you want to configure any settings over one core's sysfs nodes,
you want those to be visible across all cores automagically:


Hmm, yes of course. This of course breaks on our slave servers when the 
shared mechanism doesn't work properly (i.e NB not visible). Then all 
cores gets individual kobjects and there can be discrepancies between 
what the hardware is programmed to and what is reflected in /sys on some 
cores..


Ok, we go with our first approach to not create MC4 at all if NB isn't 
visible.


We'll redo the patch against the tip:x86/ras branch.

Cheers,
Steffen

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-09 Thread Steffen Persvold


On 4/9/2013 12:24 PM, Borislav Petkov wrote:

On Tue, Apr 09, 2013 at 11:45:44AM +0200, Steffen Persvold wrote:

Hmm, yes of course. This of course breaks on our slave servers when
the shared mechanism doesn't work properly (i.e NB not visible). Then
all cores gets individual kobjects and there can be discrepancies
between what the hardware is programmed to and what is reflected in
/sys on some cores..


Hold on, are you saying you have cores with an invisible NB? How does
that even work? Or is it only invisible to sw?


only invisible to the kernel because the multi-pci-domains isn't working 
pre 3.9 on our architecture.


cheers,
Steffen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-04 Thread Steffen Persvold


On 4/4/2013 9:07 PM, Borislav Petkov wrote:

On Thu, Apr 04, 2013 at 08:05:46PM +0200, Steffen Persvold wrote:

It made more sense (to me) to skip the creation of MC4 all together
if you can't find the matching northbridge since you can't reliably
do the dec_and_test() reference counting on the shared bank when you
don't have the common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?


No, actually *this* explanation should've been in the commit message.
You numascale people do crazy things with the hardware :) so explaining
yourself more verbosely is an absolute must if anyone is to understand
why you're changing the code.


Ok :)



So please write a detailed commit message why you need this change,
don't be afraid to talk about the big picture.


Will do.



Also, I'm guessing this is urgent stuff and it needs to go into 3.9?
Yes, no? If yes, this patch should probably be tagged for stable.


Yes. We found the issue on -stable at first (3.8.2 iirc) because it 
doesn't have the multi-domain support we needed (which is added in 3.9).




Also, please redo this patch against tip:x86/ras which already has
patches touching mce_amd.c.


Ok.



Oh, and lastly, needless to say, it needs to be tested on a "normal",
i.e. !numascale AMD multinode box, in case you haven't done so yet. :-)



It has been tested on "normal" platforms and NumaConnect platforms 
(Fam10h and Fam15h AMD processors, SCM and MCM versions).


Cheers,
Steffen
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-04 Thread Steffen Persvold

On 4/4/2013 6:13 PM, Borislav Petkov wrote:
> On Thu, Apr 04, 2013 at 11:52:00PM +0800, Daniel J Blueman wrote:
>> On platforms where all Northbridges may not be visible (due to routing, eg on
>> NumaConnect systems), prevent oopsing due to stale pointer access when
>> offlining cores.
>>
>> Signed-off-by: Steffen Persvold 
>> Signed-off-by: Daniel J Blueman 
> 
> Huh, what's up?
> 
> This one is almost reverting 21c5e50e15b1a which you wrote in the first
> place. What's happening? What stale pointer access, where? We have the
> if (nb ..) guards there.
> 
> This commit message needs a *lot* more explanation about what's going
> on and why we're reverting 21c5e50e15b1a. And why the special handling
> for shared banks? I presume you offline some of the cores and there's a
> dangling pointer but again, there are the nb validity guards...
> 
> /me is genuinely confused.
> 

You get oopses when offlining cores when there's no NB struct for the shared 
MC4 bank. In threshold_remove_bank(), there's no "if (!nb)" guard :

if (shared_bank[bank]) {
if (!atomic_dec_and_test(>cpus)) {
__threshold_remove_blocks(b);
per_cpu(threshold_banks, cpu)[bank] = NULL;
return;
} else {
/*
 * the last CPU on this node using the shared bank is
 * going away, remove that bank now.
 */
nb = node_to_amd_nb(amd_get_nb_id(cpu));
nb->bank4 = NULL;
}
}

nb->bank4 = NULL will oops, since nb is NULL.

It made more sense (to me) to skip the creation of MC4 all together if you 
can't find the matching northbridge since you can't reliably do the 
dec_and_test() reference counting on the shared bank when you don't have the 
common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?

Cheers,
Steffen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-04 Thread Steffen Persvold

On 4/4/2013 6:13 PM, Borislav Petkov wrote:
 On Thu, Apr 04, 2013 at 11:52:00PM +0800, Daniel J Blueman wrote:
 On platforms where all Northbridges may not be visible (due to routing, eg on
 NumaConnect systems), prevent oopsing due to stale pointer access when
 offlining cores.

 Signed-off-by: Steffen Persvold s...@numascale.com
 Signed-off-by: Daniel J Blueman dan...@numascale-asia.com
 
 Huh, what's up?
 
 This one is almost reverting 21c5e50e15b1a which you wrote in the first
 place. What's happening? What stale pointer access, where? We have the
 if (nb ..) guards there.
 
 This commit message needs a *lot* more explanation about what's going
 on and why we're reverting 21c5e50e15b1a. And why the special handling
 for shared banks? I presume you offline some of the cores and there's a
 dangling pointer but again, there are the nb validity guards...
 
 /me is genuinely confused.
 

You get oopses when offlining cores when there's no NB struct for the shared 
MC4 bank. In threshold_remove_bank(), there's no if (!nb) guard :

if (shared_bank[bank]) {
if (!atomic_dec_and_test(b-cpus)) {
__threshold_remove_blocks(b);
per_cpu(threshold_banks, cpu)[bank] = NULL;
return;
} else {
/*
 * the last CPU on this node using the shared bank is
 * going away, remove that bank now.
 */
nb = node_to_amd_nb(amd_get_nb_id(cpu));
nb-bank4 = NULL;
}
}


nb-bank4 = NULL will oops, since nb is NULL.

It made more sense (to me) to skip the creation of MC4 all together if you 
can't find the matching northbridge since you can't reliably do the 
dec_and_test() reference counting on the shared bank when you don't have the 
common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?

Cheers,
Steffen



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, amd, mce: Prevent potential cpu-online oops

2013-04-04 Thread Steffen Persvold


On 4/4/2013 9:07 PM, Borislav Petkov wrote:

On Thu, Apr 04, 2013 at 08:05:46PM +0200, Steffen Persvold wrote:

It made more sense (to me) to skip the creation of MC4 all together
if you can't find the matching northbridge since you can't reliably
do the dec_and_test() reference counting on the shared bank when you
don't have the common NB struct for all the shared cores.

Or am I just smoking the wrong stuff ?


No, actually *this* explanation should've been in the commit message.
You numascale people do crazy things with the hardware :) so explaining
yourself more verbosely is an absolute must if anyone is to understand
why you're changing the code.


Ok :)



So please write a detailed commit message why you need this change,
don't be afraid to talk about the big picture.


Will do.



Also, I'm guessing this is urgent stuff and it needs to go into 3.9?
Yes, no? If yes, this patch should probably be tagged for stable.


Yes. We found the issue on -stable at first (3.8.2 iirc) because it 
doesn't have the multi-domain support we needed (which is added in 3.9).




Also, please redo this patch against tip:x86/ras which already has
patches touching mce_amd.c.


Ok.



Oh, and lastly, needless to say, it needs to be tested on a normal,
i.e. !numascale AMD multinode box, in case you haven't done so yet. :-)



It has been tested on normal platforms and NumaConnect platforms 
(Fam10h and Fam15h AMD processors, SCM and MCM versions).


Cheers,
Steffen
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 RESEND] Add NumaChip remote PCI support

2012-11-30 Thread Steffen Persvold


Hi Bjorn,

On 11/30/2012 17:45, Bjorn Helgaas wrote:

On Thu, Nov 29, 2012 at 10:28 PM, Daniel J Blueman

[]

We could expose pci_dev_base via struct x86_init_pci; the extra complexity
and performance tradeoff may not be worth it for a single case perhaps?


Oh, right, I forgot that you can't decide this at build-time.  This is
PCI config access, which is not a performance path, so I'm not really
concerned about it from that angle, but you make a good point about
the complexity.

The reason I'm interested in this is because MMCONFIG is a generic
PCIe feature but is currently done via several arch-specific
implementations, so I'm starting to think about how we can make parts
of it more generic.  From that perspective, it's nicer to parameterize
an existing implementation than to clone it because it makes
refactoring opportunities more obvious.

Backing up a bit, I'm curious about exactly why you need to check for
the limit to begin with.  The comment says "Ensure AMD Northbridges
don't decode reads to other devices," but that doesn't seem strictly
accurate.  You're not changing anything in the hardware to prevent it
from *decoding* a read, so it seems like you're actually just
preventing the read in the first place.

What happens without the limit check?  Do you get a response timeout
and a machine check?  Read from the wrong device?'


The latter. I'm not sure how familiar you are with how pci config reads 
are decoded and handled on coherent hypertransport fabrics; The way it 
works *within* one coherent HT fabric is that the CPU will redirect all 
config space access above a configured max HT node (a setting in the AMD 
northbridge) to a specific I/O link (non-coherent link) which usually 
links up with a "southbridge" device that responds with a target abort 
(non-existing device).


However, this only works when a CPU core is accessing local HT devices. 
In our architecture, we "glue" together multiple HT fabrics and when a 
CPU core sends a pci config space request (mmconfig) to a remote machine 
(via our hardware) this re-direction is not applied anymore. The result 
is that when a mmconfig read comes in to a coherent HT device on bus00 
which is non-existent, one of the other HT nodes on that remote node 
will respond to the read, leading to "phantom" devices (i.e lspci will 
show more HT northbridges than what's really physically present) *or* 
worst case scenario will be that the transaction hangs (alternatively 
times out, leading to MCE and other bad things).


This is why we're checking accesses to bus0, device24-31 and returning a 
"fake" target abort scenario if the access was to a non-existing HT 
device. In other words, we're doing in software what a "normal" HT based 
platform would do in hardware.




As far as I can tell, you still describe your MMCONFIG area with an
MCFG table (since you use pci_mmconfig_lookup() to find the region).
That table only includes the starting and ending bus numbers, so the
assumption is that the MMCONFIG space is valid for every possible
device on those buses.  So it seems like your system is not really
compatible with the spec here.

Because the MCFG table can't describe finer granularity than start/end
bus numbers, we manage MMCONFIG regions as (segment, start_bus,
end_bus, address) tuples.  Maybe if we tracked it with slightly finer
granularity, e.g., (segment, start_bus, end_bus, end_bus_device,
address), you could have some sort of MCFG-parsing quirk that reduces
the size of the MMCONFIG region you register for bus 0.

Just brainstorming here; it's not obvious to me yet what the best solution is.

Bjorn



Kind regards,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 RESEND] Add NumaChip remote PCI support

2012-11-30 Thread Steffen Persvold


Hi Bjorn,

On 11/30/2012 17:45, Bjorn Helgaas wrote:

On Thu, Nov 29, 2012 at 10:28 PM, Daniel J Blueman

[]

We could expose pci_dev_base via struct x86_init_pci; the extra complexity
and performance tradeoff may not be worth it for a single case perhaps?


Oh, right, I forgot that you can't decide this at build-time.  This is
PCI config access, which is not a performance path, so I'm not really
concerned about it from that angle, but you make a good point about
the complexity.

The reason I'm interested in this is because MMCONFIG is a generic
PCIe feature but is currently done via several arch-specific
implementations, so I'm starting to think about how we can make parts
of it more generic.  From that perspective, it's nicer to parameterize
an existing implementation than to clone it because it makes
refactoring opportunities more obvious.

Backing up a bit, I'm curious about exactly why you need to check for
the limit to begin with.  The comment says Ensure AMD Northbridges
don't decode reads to other devices, but that doesn't seem strictly
accurate.  You're not changing anything in the hardware to prevent it
from *decoding* a read, so it seems like you're actually just
preventing the read in the first place.

What happens without the limit check?  Do you get a response timeout
and a machine check?  Read from the wrong device?'


The latter. I'm not sure how familiar you are with how pci config reads 
are decoded and handled on coherent hypertransport fabrics; The way it 
works *within* one coherent HT fabric is that the CPU will redirect all 
config space access above a configured max HT node (a setting in the AMD 
northbridge) to a specific I/O link (non-coherent link) which usually 
links up with a southbridge device that responds with a target abort 
(non-existing device).


However, this only works when a CPU core is accessing local HT devices. 
In our architecture, we glue together multiple HT fabrics and when a 
CPU core sends a pci config space request (mmconfig) to a remote machine 
(via our hardware) this re-direction is not applied anymore. The result 
is that when a mmconfig read comes in to a coherent HT device on bus00 
which is non-existent, one of the other HT nodes on that remote node 
will respond to the read, leading to phantom devices (i.e lspci will 
show more HT northbridges than what's really physically present) *or* 
worst case scenario will be that the transaction hangs (alternatively 
times out, leading to MCE and other bad things).


This is why we're checking accesses to bus0, device24-31 and returning a 
fake target abort scenario if the access was to a non-existing HT 
device. In other words, we're doing in software what a normal HT based 
platform would do in hardware.




As far as I can tell, you still describe your MMCONFIG area with an
MCFG table (since you use pci_mmconfig_lookup() to find the region).
That table only includes the starting and ending bus numbers, so the
assumption is that the MMCONFIG space is valid for every possible
device on those buses.  So it seems like your system is not really
compatible with the spec here.

Because the MCFG table can't describe finer granularity than start/end
bus numbers, we manage MMCONFIG regions as (segment, start_bus,
end_bus, address) tuples.  Maybe if we tracked it with slightly finer
granularity, e.g., (segment, start_bus, end_bus, end_bus_device,
address), you could have some sort of MCFG-parsing quirk that reduces
the size of the MMCONFIG region you register for bus 0.

Just brainstorming here; it's not obvious to me yet what the best solution is.

Bjorn



Kind regards,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Fix AMD Northbridge-ID contiguity assumptions

2012-10-12 Thread Steffen Persvold


On 10/12/2012 11:33, Borislav Petkov wrote:

On Thu, Oct 04, 2012 at 03:18:02PM +0200, Borislav Petkov wrote:

On Wed, Oct 03, 2012 at 09:21:14PM +0800, Daniel J Blueman wrote:

The AMD Northbridge initialisation code and EDAC assume the Northbridge IDs
are contiguous, which no longer holds on federated systems with multiple
HyperTransport fabrics and multiple PCI domains.

Address this assumption by searching the Northbridge ID array, rather than
directly indexing it, using the upper bits for the PCI domain.

v2: Fix Northbridge entry initialisation

Tested on a single-socket system and 3-server federated system.

Signed-off-by: Daniel J Blueman 
---
  arch/x86/include/asm/amd_nb.h |   23 +--
  arch/x86/kernel/amd_nb.c  |   16 +---
  drivers/edac/amd64_edac.c |   18 +-
  drivers/edac/amd64_edac.h |6 --


Ok,

I've been meaning to clean up that amd_nb.c code which iterates over all
PCI devices on the system just so it can count the NBs and then do it
again in order to do the ->misc and ->link assignment.

So below is what I've come up with and it builds but it is completely
untested and I might be completely off, for all I know.

The basic idea, though, is to have the first 8 NB descriptors in an
array of 8 and use that for a fast lookup on all those single-board
machines where the number of the northbridges is the number of physical
processors on the board (or x2, if a MCM).

Then, there's a linked list of all further NB descriptors which should
work in your case of confederated systems.

Btw, I've also reused your get_node_id function and the edac changes are
still pending but they should be trivial once this new approach pans
out.




Hi Boris,

This patch looks very clean and should serve our purpose as well (I'll 
double check with Daniel).


Regarding the size of the "node" variable, you asked before. The 
theoretical maximum number of AMD NBs we can have in a confederated 
NumaConnect system _today_ is 8*4096 (8 NBs per system, 4096 systems) so 
technically this could fit into a u16 instead of a u32 (you'll have to 
shift left by 3 instead of 8).


However, to allow some flexibility I think a u32 is better and I think 
we can live with those two extra bytes per struct member, or ?


Cheers,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] Fix AMD Northbridge-ID contiguity assumptions

2012-10-12 Thread Steffen Persvold


On 10/12/2012 11:33, Borislav Petkov wrote:

On Thu, Oct 04, 2012 at 03:18:02PM +0200, Borislav Petkov wrote:

On Wed, Oct 03, 2012 at 09:21:14PM +0800, Daniel J Blueman wrote:

The AMD Northbridge initialisation code and EDAC assume the Northbridge IDs
are contiguous, which no longer holds on federated systems with multiple
HyperTransport fabrics and multiple PCI domains.

Address this assumption by searching the Northbridge ID array, rather than
directly indexing it, using the upper bits for the PCI domain.

v2: Fix Northbridge entry initialisation

Tested on a single-socket system and 3-server federated system.

Signed-off-by: Daniel J Blueman dan...@numascale-asia.com
---
  arch/x86/include/asm/amd_nb.h |   23 +--
  arch/x86/kernel/amd_nb.c  |   16 +---
  drivers/edac/amd64_edac.c |   18 +-
  drivers/edac/amd64_edac.h |6 --


Ok,

I've been meaning to clean up that amd_nb.c code which iterates over all
PCI devices on the system just so it can count the NBs and then do it
again in order to do the -misc and -link assignment.

So below is what I've come up with and it builds but it is completely
untested and I might be completely off, for all I know.

The basic idea, though, is to have the first 8 NB descriptors in an
array of 8 and use that for a fast lookup on all those single-board
machines where the number of the northbridges is the number of physical
processors on the board (or x2, if a MCM).

Then, there's a linked list of all further NB descriptors which should
work in your case of confederated systems.

Btw, I've also reused your get_node_id function and the edac changes are
still pending but they should be trivial once this new approach pans
out.




Hi Boris,

This patch looks very clean and should serve our purpose as well (I'll 
double check with Daniel).


Regarding the size of the node variable, you asked before. The 
theoretical maximum number of AMD NBs we can have in a confederated 
NumaConnect system _today_ is 8*4096 (8 NBs per system, 4096 systems) so 
technically this could fit into a u16 instead of a u32 (you'll have to 
shift left by 3 instead of 8).


However, to allow some flexibility I think a u32 is better and I think 
we can live with those two extra bytes per struct member, or ?


Cheers,
--
Steffen Persvold, Chief Architect NumaChip
Numascale AS - www.numascale.com
Tel: +47 92 49 25 54 Skype: spersvold
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMA memory limitation?

2001-07-06 Thread Steffen Persvold


> > GFP_DMA is ISA dma reachable, Forget the IA64, their setup is weird and 
> > should best be ignored until 2.5 as and when they sort it out.

Really ? I don't think I can ignore IA64, there are people who ask for it

> > > bounce buffers are needed. On Alpha GFP_DMA is not limited at all (I think). 
>Correct me if
> >
> > Alpha has various IOMMU facilities
> >
> > > I'm wrong, but I really think there should be a general way of allocating memory 
>that is
> > > 32bit addressable (something like GFP_32BIT?) so you don't need a lot of 
>#ifdef's in your
> > > code.
> > No ifdefs are needed
> >
> > GFP_DMA - ISA dma reachable
> > pci_alloc_* and friends - PCI usable memory
> 
> pci_alloc_* is designed to support ISA.
> 
> Pass pci_dev==NULL to pci_alloc_* for ISA devices, and it allocs GFP_DMA
> for you.
> 

Sure, but the IA64 platforms that are out now doesn't have an IOMMU, so bounce buffers 
are
used if you don't specify GFP_DMA in your get_free_page.

Now lets say you have a driver with a page allocator. Eventually you want to make some 
if
the allocated pages available to a 32bit PCI device. These pages has to be consistent 
(i.e
the driver doesn't have to wait for a PCI flush for the data to be valid, sort of like 
a
ethernet ring buffer). I could use the pci_alloc_consistent() function
(pci_alloc_consistent() allocates a buffer with GFP_DMA on IA64), but since I already 
have
the pages, I have to use pci_map_single (or pci_map_sg). Inside pci_map_single on IA64
something called swiotlb buffers (bounce buffers) are used if the device can't support
64bit addressing and the address of the memory to map is above the 4G limit. The 
swiotlb
buffers are below the 4G limit and therefore reachable by any PCI device. The problem
about these buffers are that the content are not copied to the original location before
you do a pci_sync_* or a pci_unmap_* (they are not consistent) and they are a limited
resource (allocated at boot time). My solution for now was to use :

#if defined(__ia64__)
  int flag = GFP_DMA;
#else
  int flag = 0;
#endif

Maybe IA64 could implement GFP_HIGHMEM (as on i386) so that if no flags were used you 
were
guaranteed to get 32bit memory ???


Regards
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMA memory limitation?

2001-07-06 Thread Steffen Persvold


Helge Hafting wrote:
> 
> Vasu Varma P V wrote:
> >
> > Hi,
> >
> > Is there any limitation on DMA memory we can allocate using
> > kmalloc(size, GFP_DMA)? I am not able to acquire more than
> > 14MB of the mem using this on my PCI SMP box with 256MB ram.
> > I think there is restriction on ISA boards of 16MB.
> > Can we increase it ?
> 
> You can allocate a lot more memory for your pci activities.
> No problem there.  Just drop the "GFP_DMA" and you'll get
> up to 1G or so.
> 
> You shouldn't use GFP_DMA because PCI cards don't need that.
> Only ISA cards needs GFP_DMA because they can't use more
> than 16M.  So obviously GFP_DMA is limited to
> 16M because it is really ISA_DMA.
> 
> PCI don't need such special tricks, so don't use GFP_DMA!
> Your PCI cards is able to DMA into any memory, including
> the non-GFP_DMA memory.
> 
> > but we have a macro in include/asm-i386/dma.h,
> > MAX_DMA_ADDRESS  (PAGE_OFFSET+0x100).
> >
> > if i change it to a higher value, i am able to get more dma
> > memory. Is there any way i can change this without compiling
> > the kernel?
> >
> No matter what you do, DON'T change that.  Yeah, you'll get
> a bigger GFP_DMA pool, but that'll break each and every
> ISA card that tries to allocate GFP_DMA memory.  You
> achieve exactly the same effect for your PCI card by ditching
> the GFP_DMA parameter, but then you achieve it without breaking
> ISA cards.
> 
A problem arises on 64 bit platforms (such as IA64) if your PCI card is only 32bit (can
address the first 4G) and you don't wan't to use bounce buffers. If you use GFP_DMA on
IA64 you are ensured that the memory you get is below 4G and not 16M as on i386, hence 
no
bounce buffers are needed. On Alpha GFP_DMA is not limited at all (I think). Correct 
me if
I'm wrong, but I really think there should be a general way of allocating memory that 
is
32bit addressable (something like GFP_32BIT?) so you don't need a lot of #ifdef's in 
your
code.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMA memory limitation?

2001-07-06 Thread Steffen Persvold


Helge Hafting wrote:
 
 Vasu Varma P V wrote:
 
  Hi,
 
  Is there any limitation on DMA memory we can allocate using
  kmalloc(size, GFP_DMA)? I am not able to acquire more than
  14MB of the mem using this on my PCI SMP box with 256MB ram.
  I think there is restriction on ISA boards of 16MB.
  Can we increase it ?
 
 You can allocate a lot more memory for your pci activities.
 No problem there.  Just drop the GFP_DMA and you'll get
 up to 1G or so.
 
 You shouldn't use GFP_DMA because PCI cards don't need that.
 Only ISA cards needs GFP_DMA because they can't use more
 than 16M.  So obviously GFP_DMA is limited to
 16M because it is really ISA_DMA.
 
 PCI don't need such special tricks, so don't use GFP_DMA!
 Your PCI cards is able to DMA into any memory, including
 the non-GFP_DMA memory.
 
  but we have a macro in include/asm-i386/dma.h,
  MAX_DMA_ADDRESS  (PAGE_OFFSET+0x100).
 
  if i change it to a higher value, i am able to get more dma
  memory. Is there any way i can change this without compiling
  the kernel?
 
 No matter what you do, DON'T change that.  Yeah, you'll get
 a bigger GFP_DMA pool, but that'll break each and every
 ISA card that tries to allocate GFP_DMA memory.  You
 achieve exactly the same effect for your PCI card by ditching
 the GFP_DMA parameter, but then you achieve it without breaking
 ISA cards.
 
A problem arises on 64 bit platforms (such as IA64) if your PCI card is only 32bit (can
address the first 4G) and you don't wan't to use bounce buffers. If you use GFP_DMA on
IA64 you are ensured that the memory you get is below 4G and not 16M as on i386, hence 
no
bounce buffers are needed. On Alpha GFP_DMA is not limited at all (I think). Correct 
me if
I'm wrong, but I really think there should be a general way of allocating memory that 
is
32bit addressable (something like GFP_32BIT?) so you don't need a lot of #ifdef's in 
your
code.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: DMA memory limitation?

2001-07-06 Thread Steffen Persvold


  GFP_DMA is ISA dma reachable, Forget the IA64, their setup is weird and 
  should best be ignored until 2.5 as and when they sort it out.

Really ? I don't think I can ignore IA64, there are people who ask for it

   bounce buffers are needed. On Alpha GFP_DMA is not limited at all (I think). 
Correct me if
 
  Alpha has various IOMMU facilities
 
   I'm wrong, but I really think there should be a general way of allocating memory 
that is
   32bit addressable (something like GFP_32BIT?) so you don't need a lot of 
#ifdef's in your
   code.
  No ifdefs are needed
 
  GFP_DMA - ISA dma reachable
  pci_alloc_* and friends - PCI usable memory
 
 pci_alloc_* is designed to support ISA.
 
 Pass pci_dev==NULL to pci_alloc_* for ISA devices, and it allocs GFP_DMA
 for you.
 

Sure, but the IA64 platforms that are out now doesn't have an IOMMU, so bounce buffers 
are
used if you don't specify GFP_DMA in your get_free_page.

Now lets say you have a driver with a page allocator. Eventually you want to make some 
if
the allocated pages available to a 32bit PCI device. These pages has to be consistent 
(i.e
the driver doesn't have to wait for a PCI flush for the data to be valid, sort of like 
a
ethernet ring buffer). I could use the pci_alloc_consistent() function
(pci_alloc_consistent() allocates a buffer with GFP_DMA on IA64), but since I already 
have
the pages, I have to use pci_map_single (or pci_map_sg). Inside pci_map_single on IA64
something called swiotlb buffers (bounce buffers) are used if the device can't support
64bit addressing and the address of the memory to map is above the 4G limit. The 
swiotlb
buffers are below the 4G limit and therefore reachable by any PCI device. The problem
about these buffers are that the content are not copied to the original location before
you do a pci_sync_* or a pci_unmap_* (they are not consistent) and they are a limited
resource (allocated at boot time). My solution for now was to use :

#if defined(__ia64__)
  int flag = GFP_DMA;
#else
  int flag = 0;
#endif

Maybe IA64 could implement GFP_HIGHMEM (as on i386) so that if no flags were used you 
were
guaranteed to get 32bit memory ???


Regards
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Router problems with transparent proxy

2001-07-01 Thread Steffen Persvold


Hi,

I think I've triggered a bug in the ipchains/iptables part of the kernel. Here is the
story :

The server was a 866MHz PIII with 384 MByte of RAM running RH7.1 with a 2.4.5-ac21 
kernel.
It was used as a router/firewall with 2 netcards (not sure which type, but I don't 
think
that's important). Using this machine as a plain router was no problem at all, and 
serving
a class C net onto a 3 MBit line was a just a walk in the park, the machine was idle 
for
most of the time. Then we decided to set up transparent proxy and used a pretty 
standard
setup redirecting all port 80 accesses with ipchains to squid. Things worked fine for a
while (about 2 hrs) until we noticed that the machine got extremly unresponsive on the
console. A 'top' session showed us that the machine was almost a 100% in system time. 
If
we disconnected the some of the segments on the C net, system time went down a bit. We
rebooted the machine and noticed that the system time started at zero and went slowly
upwards until it reached 100 (after about 2hrs) and we just needed to reboot again. We
just disabled the ipchains stuff, and now the server is rock solid with a 'normal' 
proxy
setup (and 100% idle almost all the time). Just for the record : We also tried standard
RH7.1 kernels (2.4.2-2 and 2.4.3) with the same results.

Any ideas ? Anybody experienced similar behaviour ? It looks like a resource leak
somewhere in the IP filter code to me.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Router problems with transparent proxy

2001-07-01 Thread Steffen Persvold


Hi,

I think I've triggered a bug in the ipchains/iptables part of the kernel. Here is the
story :

The server was a 866MHz PIII with 384 MByte of RAM running RH7.1 with a 2.4.5-ac21 
kernel.
It was used as a router/firewall with 2 netcards (not sure which type, but I don't 
think
that's important). Using this machine as a plain router was no problem at all, and 
serving
a class C net onto a 3 MBit line was a just a walk in the park, the machine was idle 
for
most of the time. Then we decided to set up transparent proxy and used a pretty 
standard
setup redirecting all port 80 accesses with ipchains to squid. Things worked fine for a
while (about 2 hrs) until we noticed that the machine got extremly unresponsive on the
console. A 'top' session showed us that the machine was almost a 100% in system time. 
If
we disconnected the some of the segments on the C net, system time went down a bit. We
rebooted the machine and noticed that the system time started at zero and went slowly
upwards until it reached 100 (after about 2hrs) and we just needed to reboot again. We
just disabled the ipchains stuff, and now the server is rock solid with a 'normal' 
proxy
setup (and 100% idle almost all the time). Just for the record : We also tried standard
RH7.1 kernels (2.4.2-2 and 2.4.3) with the same results.

Any ideas ? Anybody experienced similar behaviour ? It looks like a resource leak
somewhere in the IP filter code to me.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Functionality of mmap, nopage and remap_page_range

2001-06-12 Thread Steffen Persvold


Hi kernel list readers,

I have a question about the functionality of mmap(), vma->vm_ops functions and 
different
vma->vm_flags. Is there any documentation that describes these methods and how they 
should
work (i.e when should mmap() use remap_page_range and when is the vma->vm_ops->no_page
function called)

Any  help appreciated.
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Functionality of mmap, nopage and remap_page_range

2001-06-12 Thread Steffen Persvold


Hi kernel list readers,

I have a question about the functionality of mmap(), vma-vm_ops functions and 
different
vma-vm_flags. Is there any documentation that describes these methods and how they 
should
work (i.e when should mmap() use remap_page_range and when is the vma-vm_ops-no_page
function called)

Any  help appreciated.
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: temperature standard - global config option?

2001-06-09 Thread Steffen Persvold


"L. K." wrote:
> I haven't encountered any CPU with builtin temperature sensors.
> 
Eh, all Pentium class cpus have a build in sensor for core temperature (I believe 
Athlons
too). It's just the logic which is outside in form of a A/D converter connected to a 
I2C
bus.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: temperature standard - global config option?

2001-06-09 Thread Steffen Persvold


L. K. wrote:
 I haven't encountered any CPU with builtin temperature sensors.
 
Eh, all Pentium class cpus have a build in sensor for core temperature (I believe 
Athlons
too). It's just the logic which is outside in form of a A/D converter connected to a 
I2C
bus.

Regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Question regarding pci_alloc_consitent() and __get_free_pages

2001-06-01 Thread Steffen Persvold


Hi,

I have a question regarding the pci_alloc_consistent() function. Will this function
allocate pages that are physical contiguous ? i.e if I call this function with a size
argument of 32KByte will that be 8 consecutive pages in memory on i386 architecture (4
pages on alpha). In general, will __get_free_pages(GFP_ATOMIC, order) always return
physical contiguous memory ?

All feedback appreciated,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Question regarding pci_alloc_consitent() and __get_free_pages

2001-06-01 Thread Steffen Persvold


Hi,

I have a question regarding the pci_alloc_consistent() function. Will this function
allocate pages that are physical contiguous ? i.e if I call this function with a size
argument of 32KByte will that be 8 consecutive pages in memory on i386 architecture (4
pages on alpha). In general, will __get_free_pages(GFP_ATOMIC, order) always return
physical contiguous memory ?

All feedback appreciated,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: VIA/PDC/Athlon

2001-05-18 Thread Steffen Persvold


Pavel Roskin wrote:
> 
> Hello, Zilvinas!
> 
> There are utilities that work with PnP BIOS. They are included with
> pcmcia-cs (which is weird - it should be a separate package) and called
> "lspci" and "setpci". They depend on PnP BIOS support in the kernel
> (CONFIG_PNPBIOS).
> 
> Dumping your PnP BIOS configuration and checking whether it has changed
> after booting to Windows would be more reasonable than checking your PCI
> configuration (IMHO).

Ehm, "lspci" and "setpci" is part of the pci-utils package (at least on RedHat)
and is used to dump/modify PCI configuration space (/proc/bus/pci). If you know
how to use these tools to dump PNP bios, please tell us.

Regards
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: VIA/PDC/Athlon

2001-05-18 Thread Steffen Persvold


Pavel Roskin wrote:
 
 Hello, Zilvinas!
 
 There are utilities that work with PnP BIOS. They are included with
 pcmcia-cs (which is weird - it should be a separate package) and called
 lspci and setpci. They depend on PnP BIOS support in the kernel
 (CONFIG_PNPBIOS).
 
 Dumping your PnP BIOS configuration and checking whether it has changed
 after booting to Windows would be more reasonable than checking your PCI
 configuration (IMHO).

Ehm, lspci and setpci is part of the pci-utils package (at least on RedHat)
and is used to dump/modify PCI configuration space (/proc/bus/pci). If you know
how to use these tools to dump PNP bios, please tell us.

Regards
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel crash using NFSv3 on 2.4.4

2001-04-30 Thread Steffen Persvold


Hi all,

I have compiled a stock 2.4.4 kernel and applied SGI's kdb patch v1.8. Most of
the time this runs just fine, but one time when I tried to copy a file from a
NFS server I got a kernel fault. Luckily it jumped right into the debugger and
I could to some batcktracing (quite useful!) :

Unable to handle kernel paging request at virtual address 414478b1
 printing eip:
c012c826
*pde = 

Entering kdb (current=0xca07a000, pid 971) on processor 0 Oops: Oops
due to oops @ 0xc012c826
eax = 0x2000 ebx = 0xc15e4800 ecx = 0x edx = 0xc1447899
esi = 0x edi = 0xc14477a0 esp = 0xca07ba98 eip = 0xc012c826
ebp = 0xca07baa4 xss = 0x0018 xcs = 0x0010 eflags = 0x00010046
xds = 0xc1440018 xes = 0x0018 origeax = 0x  = 0xca07ba64
[0]kdb> bt
EBP   EIP Function(args)
0xca07baa4 0xc012c826 kmem_cache_alloc_batch+0x46 (0xc14477a0, 0x7, 0xcb965260)
   kernel .text 0xc010 0xc012c7e0 0xc012c864
0xca07bad0 0xc012ca8e kmalloc+0x82 (0x13c, 0x7, 0xca4ca040, 0x0)
   kernel .text 0xc010 0xc012ca0c 0xc012cb1c
0xca07baf4 0xc01fd254 alloc_skb+0x104 (0x100, 0x7)
   kernel .text 0xc010 0xc01fd150 0xc01fd31c
0xca07bb14 0xc01fc85c sock_alloc_send_skb+0x68 (0xca4ca040, 0xe7, 0x40,
0xca07bb44)
   kernel .text 0xc010 0xc01fc7f4 0xc01fc8f8
0xca07bb48 0xc020ff58 ip_build_xmit+0xe8 (0xca4ca040, 0xc022834c, 0xca07bbac,
0xc8, 0xca07bbc4)
   kernel .text 0xc010 0xc020fe70 0xc02101cc
0xca07bbd0 0xc022879c udp_sendmsg+0x344 (0xca4ca040, 0xca07bcc0, 0xac)
   kernel .text 0xc010 0xc0228458 0xc0228824
0xca07bbe8 0xc022e494 inet_sendmsg+0x40 (0xca1a2d08, 0xca07bcc0, 0xac,
0xca07bc18, 0xc000)
   kernel .text 0xc010 0xc022e454 0xc022e49c
0xca07bc2c 0xc01fa27e sock_sendmsg+0x7a (0xca1a2d08, 0xca07bcc0, 0xac)
   kernel .text 0xc010 0xc01fa204 0xc01fa2a0
0xca07bcdc 0xd089c9e4 [sunrpc]do_xprt_transmit+0x158 (0xca07bd70)
   sunrpc .text 0xd089a060 0xd089c88c 0xd089ccc0
0xca07bcf4 0xd089c87f [sunrpc]xprt_transmit+0xa3 (0xca07bd70)
   sunrpc .text 0xd089a060 0xd089c7dc 0xd089c88c
0xca07bd08 0xd089aa43 [sunrpc]call_transmit+0x43 (0xca07bd70)
[0]more>   
   sunrpc .text 0xd089a060 0xd089aa00 0xd089aa6c
0xca07bd38 0xd089e1de [sunrpc]__rpc_execute+0xa6 (0xca07bd70, 0x0)
   sunrpc .text 0xd089a060 0xd089e138 0xd089e3ec
0xca07bd4c 0xd089e448 [sunrpc]rpc_execute_Rsmp_cbcaa361+0x5c (0xca07bd70,
0xca07be2c)
   sunrpc .text 0xd089a060 0xd089e3ec 0xd089e464
0xca07bdf8 0xd089a493 [sunrpc]rpc_call_sync_Rsmp_1a543287+0x73 (0xcb6f09a0,
0xca07be34, 0x0, 0xca07a000)
   sunrpc .text 0xd089a060 0xd089a420 0xd089a4c0
0xca07beb0 0xd09524e4 [nfs]nfs3_proc_access+0x108 (0xca1fc600, 0x1, 0x0)
   nfs .text 0xd0948060 0xd09523dc 0xd0952534
0xca07bed8 0xd094f739 [nfs]nfs_permission+0x8d (0xca1fc600, 0x1)
   nfs .text 0xd0948060 0xd094f6ac 0xd094f7b0
0xca07bef4 0xc01407db permission+0x4b (0xca1fc600, 0x1)
   kernel .text 0xc010 0xc0140790 0xc0140828
0xca07bf28 0xc0141488 path_walk+0x898 (0xcfdb401b, 0xca07bf7c)
   kernel .text 0xc010 0xc0140bf0 0xc01414b0
0xca07bf58 0xc0141ba6 open_namei+0x8a (0xcfdb4000, 0x8001, 0x0, 0xca07bf7c)
   kernel .text 0xc010 0xc0141b1c 0xc0142100
0xca07bf98 0xc0134fda filp_open+0x3a (0xcfdb4000, 0x8000, 0x0)
   kernel .text 0xc010 0xc0134fa0 0xc0134ffc
0xca07bfbc 0xc01352fe sys_open+0x42 (0xbc18, 0x8000, 0x0, 0x2, 0xbc55)
   kernel .text 0xc010 0xc01352bc 0xc01353bc
   0xc0106ea7 system_call+0x33
   kernel .text 0xc010 0xc0106e74 0xc0106eac
[0]more>

-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kernel crash using NFSv3 on 2.4.4

2001-04-30 Thread Steffen Persvold


Hi all,

I have compiled a stock 2.4.4 kernel and applied SGI's kdb patch v1.8. Most of
the time this runs just fine, but one time when I tried to copy a file from a
NFS server I got a kernel fault. Luckily it jumped right into the debugger and
I could to some batcktracing (quite useful!) :

Unable to handle kernel paging request at virtual address 414478b1
 printing eip:
c012c826
*pde = 

Entering kdb (current=0xca07a000, pid 971) on processor 0 Oops: Oops
due to oops @ 0xc012c826
eax = 0x2000 ebx = 0xc15e4800 ecx = 0x edx = 0xc1447899
esi = 0x edi = 0xc14477a0 esp = 0xca07ba98 eip = 0xc012c826
ebp = 0xca07baa4 xss = 0x0018 xcs = 0x0010 eflags = 0x00010046
xds = 0xc1440018 xes = 0x0018 origeax = 0x regs = 0xca07ba64
[0]kdb bt
EBP   EIP Function(args)
0xca07baa4 0xc012c826 kmem_cache_alloc_batch+0x46 (0xc14477a0, 0x7, 0xcb965260)
   kernel .text 0xc010 0xc012c7e0 0xc012c864
0xca07bad0 0xc012ca8e kmalloc+0x82 (0x13c, 0x7, 0xca4ca040, 0x0)
   kernel .text 0xc010 0xc012ca0c 0xc012cb1c
0xca07baf4 0xc01fd254 alloc_skb+0x104 (0x100, 0x7)
   kernel .text 0xc010 0xc01fd150 0xc01fd31c
0xca07bb14 0xc01fc85c sock_alloc_send_skb+0x68 (0xca4ca040, 0xe7, 0x40,
0xca07bb44)
   kernel .text 0xc010 0xc01fc7f4 0xc01fc8f8
0xca07bb48 0xc020ff58 ip_build_xmit+0xe8 (0xca4ca040, 0xc022834c, 0xca07bbac,
0xc8, 0xca07bbc4)
   kernel .text 0xc010 0xc020fe70 0xc02101cc
0xca07bbd0 0xc022879c udp_sendmsg+0x344 (0xca4ca040, 0xca07bcc0, 0xac)
   kernel .text 0xc010 0xc0228458 0xc0228824
0xca07bbe8 0xc022e494 inet_sendmsg+0x40 (0xca1a2d08, 0xca07bcc0, 0xac,
0xca07bc18, 0xc000)
   kernel .text 0xc010 0xc022e454 0xc022e49c
0xca07bc2c 0xc01fa27e sock_sendmsg+0x7a (0xca1a2d08, 0xca07bcc0, 0xac)
   kernel .text 0xc010 0xc01fa204 0xc01fa2a0
0xca07bcdc 0xd089c9e4 [sunrpc]do_xprt_transmit+0x158 (0xca07bd70)
   sunrpc .text 0xd089a060 0xd089c88c 0xd089ccc0
0xca07bcf4 0xd089c87f [sunrpc]xprt_transmit+0xa3 (0xca07bd70)
   sunrpc .text 0xd089a060 0xd089c7dc 0xd089c88c
0xca07bd08 0xd089aa43 [sunrpc]call_transmit+0x43 (0xca07bd70)
[0]more   
   sunrpc .text 0xd089a060 0xd089aa00 0xd089aa6c
0xca07bd38 0xd089e1de [sunrpc]__rpc_execute+0xa6 (0xca07bd70, 0x0)
   sunrpc .text 0xd089a060 0xd089e138 0xd089e3ec
0xca07bd4c 0xd089e448 [sunrpc]rpc_execute_Rsmp_cbcaa361+0x5c (0xca07bd70,
0xca07be2c)
   sunrpc .text 0xd089a060 0xd089e3ec 0xd089e464
0xca07bdf8 0xd089a493 [sunrpc]rpc_call_sync_Rsmp_1a543287+0x73 (0xcb6f09a0,
0xca07be34, 0x0, 0xca07a000)
   sunrpc .text 0xd089a060 0xd089a420 0xd089a4c0
0xca07beb0 0xd09524e4 [nfs]nfs3_proc_access+0x108 (0xca1fc600, 0x1, 0x0)
   nfs .text 0xd0948060 0xd09523dc 0xd0952534
0xca07bed8 0xd094f739 [nfs]nfs_permission+0x8d (0xca1fc600, 0x1)
   nfs .text 0xd0948060 0xd094f6ac 0xd094f7b0
0xca07bef4 0xc01407db permission+0x4b (0xca1fc600, 0x1)
   kernel .text 0xc010 0xc0140790 0xc0140828
0xca07bf28 0xc0141488 path_walk+0x898 (0xcfdb401b, 0xca07bf7c)
   kernel .text 0xc010 0xc0140bf0 0xc01414b0
0xca07bf58 0xc0141ba6 open_namei+0x8a (0xcfdb4000, 0x8001, 0x0, 0xca07bf7c)
   kernel .text 0xc010 0xc0141b1c 0xc0142100
0xca07bf98 0xc0134fda filp_open+0x3a (0xcfdb4000, 0x8000, 0x0)
   kernel .text 0xc010 0xc0134fa0 0xc0134ffc
0xca07bfbc 0xc01352fe sys_open+0x42 (0xbc18, 0x8000, 0x0, 0x2, 0xbc55)
   kernel .text 0xc010 0xc01352bc 0xc01353bc
   0xc0106ea7 system_call+0x33
   kernel .text 0xc010 0xc0106e74 0xc0106eac
[0]more

-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold

[EMAIL PROTECTED] wrote:
> On Sun, 29 Apr 2001, Steffen Persvold wrote:
> 
> > I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a
> > Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption on
> > the last double word (4 bytes) in a 64 byte PCI burst when I use write
> > combining on the CPU. On the Tyan however the transfer is always ok.
> >
> 
> Are you sure that is not due to board design differences?

No I can't be 100% certain that the layout of the board isn't the reason since
I haven't asked ServerWorks about this and it doesn't say anything in their
docs (yes my company has the NDA, so I shouldn't get to much in detail here),
but if this was the case it would be totally wrong to disable write combining
on any LE chipset.

The test case that I have been using to trigger this is sort of special because
we are using SCI shared memory adapters to write (with PIO) into remote nodes
memory, and the bandwidth tends to get quite high (approx 170 MByte/sec on LE
with write combining). I've been able to run this case on 5 different
motherboards using the LE and HE-SL ServerWorks chipsets, but only two of them
are LE (the DL360 and the S2510). Everything works fine with write-combining on
every motherboard except the DL360 (which has rev 5).

One basic test case that I haven't tried, could be to enable write-combining on
your PCI graphics adapter memory and see if the X display gets screwed up.

I will try to get some information from ServerWorks about this problem, but I'm
not sure if ServerWorks would be happy if I told you the answer (because of the
NDA).

Regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold


Gérard Roudier wrote:
> 
> On Sun, 29 Apr 2001, Steffen Persvold wrote:
> 
> > Hi all,
> >
> > I just compiled 2.4.4 and are running it on a Serverworks LE motherboard.
> > Whenever I try to add a write-combining region, it gets rejected. I took a peek
> > in the arch/i386/kernel/mtrr.c and found that this is just as expected with
> > v1.40 of the code. It is great that the mtrr code checks and prevents the user
> > from doing something that could eventually lead to data corruption. Using
> > write-combining on PCI acesses can lead to this on certain LE revisions but
> > _not_ all (only rev < 5). Therefore please consider my small patch to allow the
> > good ones to be able to use write-combining. I have several rev 06 and they are
> > working fine with this patch.
> 
> You wrote that 'only rev < 5' can lead to data corruption, but your patch
> seems to disallow use of write combining for rev 5 too.
> 
> Could you clarify?

Oops just a typo, it should be <= 5. The patch is correct.

> 
>   Gérard.
> 
> PS:
> >From what hat did you get this information ? as it seems that ServerWorks
> require NDA for letting know technical information on their chipsets.
> 

I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a
Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption on
the last double word (4 bytes) in a 64 byte PCI burst when I use write
combining on the CPU. On the Tyan however the transfer is always ok.

-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold


Hi all,

I just compiled 2.4.4 and are running it on a Serverworks LE motherboard.
Whenever I try to add a write-combining region, it gets rejected. I took a peek
in the arch/i386/kernel/mtrr.c and found that this is just as expected with
v1.40 of the code. It is great that the mtrr code checks and prevents the user
from doing something that could eventually lead to data corruption. Using
write-combining on PCI acesses can lead to this on certain LE revisions but
_not_ all (only rev < 5). Therefore please consider my small patch to allow the
good ones to be able to use write-combining. I have several rev 06 and they are
working fine with this patch.

Best regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA

diff -Nur linux/arch/i386/kernel/mtrr.c.~1~ linux/arch/i386/kernel/mtrr.c
--- linux/arch/i386/kernel/mtrr.c.~1~   Wed Apr 11 21:02:27 2001
+++ linux/arch/i386/kernel/mtrr.c   Sun Apr 29 10:18:06 2001
@@ -480,6 +480,7 @@
 {
 unsigned long config, dummy;
 struct pci_dev *dev = NULL;
+u8 rev;
 
/* ServerWorks LE chipsets have problems with  write-combining 
   Don't allow it and  leave room for other chipsets to be tagged */
@@ -489,7 +490,9 @@
 case PCI_VENDOR_ID_SERVERWORKS:
switch (dev->device) {
case PCI_DEVICE_ID_SERVERWORKS_LE:
-   return 0;
+   pci_read_config_byte(dev, PCI_CLASS_REVISION, );
+   if (rev <= 5)
+   return 0;
break;
default:
break;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kiobufs and userspace memory

2001-04-29 Thread Steffen Persvold


Hi all,

I'm writing a device driver for a shared memory adapter from which I plan to
support DMA directly from userspace memory (zero copy). I have already
implemented a version which I think works, but I'm not sure if I get the IO
addresses calculated correctly. The case is as follows :

The userspace application has allocated some memory with malloc() and use write
to the device's /dev entry.

The drivers write() function looks something like this :

ssize_t my_write(struct file* file, char* userbuf, size_t len, loff_t* poff)
{
  struct kiobuf * iobuf;
  size_t size;
  struct my_sglist *sglist = NULL;
  int i, err;

  /* Pin user memory */
  err = alloc_kiovec(1, );
  if (err)
return err;

  err = map_user_kiobuf(WRITE, userbuf, len);
  if (err)
goto out;

  /* Traverse the iobuf to get the IO address for each page,
   * building up the SG table for my DMA machine */

  sglist = kmalloc(sizeof(struct my_sglist) * iobuf->nr_pages, GFP_ATOMIC)
  if (!sglist)
goto out_unmap;

  totlen = 0;
  for (i = 0; i < iobuf->nr_pages; ++i)
  {
struct page *page = iobuf->maplist[i];
void *vaddr = page_address(page) + iobuf->offset;
unsigned long ioaddr = virt_to_bus(vaddr);

sglist[i].start = ioaddr;

if ((totlen + PAGE_SIZE) > len)
  sglist[i].len = len - totlen;
else
  sglist[i].len = PAGE_SIZE;

totlen + = PAGE_SIZE;
  }

  /* Start the synchronous DMA engine */
  my_start_dma(sglist);
  kfree(sglist);

 out_unmap:
  /* Unpin user memory */
  unmap_kiobuf(iobuf);

 out:
  free_kiovec(1, );
}

Is this use of kiobufs sensible to you ? If not, what should I really be doing
in order to acheive zero copy DMA ?

I also have a question regarding a DMA read to userspace memory :

If the application didn't initialize the buffer by memset or anything, all the
pages maps to the same page (the zero page) right ? So if a map_user_kiobuf()
call is made on this buffer, will it sort this out and map the pages to real
ones ?


Any response greatly appreciated,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Kiobufs and userspace memory

2001-04-29 Thread Steffen Persvold


Hi all,

I'm writing a device driver for a shared memory adapter from which I plan to
support DMA directly from userspace memory (zero copy). I have already
implemented a version which I think works, but I'm not sure if I get the IO
addresses calculated correctly. The case is as follows :

The userspace application has allocated some memory with malloc() and use write
to the device's /dev entry.

The drivers write() function looks something like this :

ssize_t my_write(struct file* file, char* userbuf, size_t len, loff_t* poff)
{
  struct kiobuf * iobuf;
  size_t size;
  struct my_sglist *sglist = NULL;
  int i, err;

  /* Pin user memory */
  err = alloc_kiovec(1, iobuf);
  if (err)
return err;

  err = map_user_kiobuf(WRITE, userbuf, len);
  if (err)
goto out;

  /* Traverse the iobuf to get the IO address for each page,
   * building up the SG table for my DMA machine */

  sglist = kmalloc(sizeof(struct my_sglist) * iobuf-nr_pages, GFP_ATOMIC)
  if (!sglist)
goto out_unmap;

  totlen = 0;
  for (i = 0; i  iobuf-nr_pages; ++i)
  {
struct page *page = iobuf-maplist[i];
void *vaddr = page_address(page) + iobuf-offset;
unsigned long ioaddr = virt_to_bus(vaddr);

sglist[i].start = ioaddr;

if ((totlen + PAGE_SIZE)  len)
  sglist[i].len = len - totlen;
else
  sglist[i].len = PAGE_SIZE;

totlen + = PAGE_SIZE;
  }

  /* Start the synchronous DMA engine */
  my_start_dma(sglist);
  kfree(sglist);

 out_unmap:
  /* Unpin user memory */
  unmap_kiobuf(iobuf);

 out:
  free_kiovec(1, iobuf);
}

Is this use of kiobufs sensible to you ? If not, what should I really be doing
in order to acheive zero copy DMA ?

I also have a question regarding a DMA read to userspace memory :

If the application didn't initialize the buffer by memset or anything, all the
pages maps to the same page (the zero page) right ? So if a map_user_kiobuf()
call is made on this buffer, will it sort this out and map the pages to real
ones ?


Any response greatly appreciated,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold


Hi all,

I just compiled 2.4.4 and are running it on a Serverworks LE motherboard.
Whenever I try to add a write-combining region, it gets rejected. I took a peek
in the arch/i386/kernel/mtrr.c and found that this is just as expected with
v1.40 of the code. It is great that the mtrr code checks and prevents the user
from doing something that could eventually lead to data corruption. Using
write-combining on PCI acesses can lead to this on certain LE revisions but
_not_ all (only rev  5). Therefore please consider my small patch to allow the
good ones to be able to use write-combining. I have several rev 06 and they are
working fine with this patch.

Best regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA

diff -Nur linux/arch/i386/kernel/mtrr.c.~1~ linux/arch/i386/kernel/mtrr.c
--- linux/arch/i386/kernel/mtrr.c.~1~   Wed Apr 11 21:02:27 2001
+++ linux/arch/i386/kernel/mtrr.c   Sun Apr 29 10:18:06 2001
@@ -480,6 +480,7 @@
 {
 unsigned long config, dummy;
 struct pci_dev *dev = NULL;
+u8 rev;
 
/* ServerWorks LE chipsets have problems with  write-combining 
   Don't allow it and  leave room for other chipsets to be tagged */
@@ -489,7 +490,9 @@
 case PCI_VENDOR_ID_SERVERWORKS:
switch (dev-device) {
case PCI_DEVICE_ID_SERVERWORKS_LE:
-   return 0;
+   pci_read_config_byte(dev, PCI_CLASS_REVISION, rev);
+   if (rev = 5)
+   return 0;
break;
default:
break;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold


Gérard Roudier wrote:
 
 On Sun, 29 Apr 2001, Steffen Persvold wrote:
 
  Hi all,
 
  I just compiled 2.4.4 and are running it on a Serverworks LE motherboard.
  Whenever I try to add a write-combining region, it gets rejected. I took a peek
  in the arch/i386/kernel/mtrr.c and found that this is just as expected with
  v1.40 of the code. It is great that the mtrr code checks and prevents the user
  from doing something that could eventually lead to data corruption. Using
  write-combining on PCI acesses can lead to this on certain LE revisions but
  _not_ all (only rev  5). Therefore please consider my small patch to allow the
  good ones to be able to use write-combining. I have several rev 06 and they are
  working fine with this patch.
 
 You wrote that 'only rev  5' can lead to data corruption, but your patch
 seems to disallow use of write combining for rev 5 too.
 
 Could you clarify?

Oops just a typo, it should be = 5. The patch is correct.

 
   Gérard.
 
 PS:
 From what hat did you get this information ? as it seems that ServerWorks
 require NDA for letting know technical information on their chipsets.
 

I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a
Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption on
the last double word (4 bytes) in a 64 byte PCI burst when I use write
combining on the CPU. On the Tyan however the transfer is always ok.

-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ServerWorks LE and MTRR

2001-04-29 Thread Steffen Persvold


[EMAIL PROTECTED] wrote:
 On Sun, 29 Apr 2001, Steffen Persvold wrote:
 
  I've learned it the hard way, I have two types : Compaq DL360 (rev 5) and a
  Tyan S2510 (rev 6). On the compaq machine I constantly get data corruption on
  the last double word (4 bytes) in a 64 byte PCI burst when I use write
  combining on the CPU. On the Tyan however the transfer is always ok.
 
 
 Are you sure that is not due to board design differences?

No I can't be 100% certain that the layout of the board isn't the reason since
I haven't asked ServerWorks about this and it doesn't say anything in their
docs (yes my company has the NDA, so I shouldn't get to much in detail here),
but if this was the case it would be totally wrong to disable write combining
on any LE chipset.

The test case that I have been using to trigger this is sort of special because
we are using SCI shared memory adapters to write (with PIO) into remote nodes
memory, and the bandwidth tends to get quite high (approx 170 MByte/sec on LE
with write combining). I've been able to run this case on 5 different
motherboards using the LE and HE-SL ServerWorks chipsets, but only two of them
are LE (the DL360 and the S2510). Everything works fine with write-combining on
every motherboard except the DL360 (which has rev 5).

One basic test case that I haven't tried, could be to enable write-combining on
your PCI graphics adapter memory and see if the X display gets screwed up.

I will try to get some information from ServerWorks about this problem, but I'm
not sure if ServerWorks would be happy if I told you the answer (because of the
NDA).

Regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Question regarding kernel threads and userlevel

2001-04-27 Thread Steffen Persvold


Hi linux-kernel,

I have a question regarding kernel threads : Are kernel threads treated equally
in terms of scheduling as normal userlevel processes ??

In my test case I have a driver for a PCI card from which I want to control
access to it's memory (prefetchable PCI space). Userlevel processes can mmap
this PCI memory and write directly to it (via the nopage technique). This is
also possible from the kernel thread, but to avoid trashing and short bursts on
the PCI bus, I protect every access to the memory space with a spin lock (a
mmapped kernel memory page which the driver initializes). That means if you
have a SMP system and two userlevel processes wants to write to this memory,
one will have to wait for the other before doing the memcpy (yep I'm using what
you can call PIO). This works great for two userlevel processes.

Now the reason for my question is; if also I have a kernel thread wanting to
write to this memory space it will also have to wait for the same lock (though
not mmapped since we are already in kernel space and can access the lock page
directly). What happens, is that if a userlevel process holds this lock and the
kernel thread gets scheduled and tries to get the same lock it will deadlock
because the userlevel process never gets back control and releases the lock
(kinda like when you spinlock in interrupt level on a lock wich is locked
without spinlock_irq). Is this because the kernel thread has higher priority
than the user level process (it has a nice level of -20) ?

Best regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Question regarding kernel threads and userlevel

2001-04-27 Thread Steffen Persvold


Hi linux-kernel,

I have a question regarding kernel threads : Are kernel threads treated equally
in terms of scheduling as normal userlevel processes ??

In my test case I have a driver for a PCI card from which I want to control
access to it's memory (prefetchable PCI space). Userlevel processes can mmap
this PCI memory and write directly to it (via the nopage technique). This is
also possible from the kernel thread, but to avoid trashing and short bursts on
the PCI bus, I protect every access to the memory space with a spin lock (a
mmapped kernel memory page which the driver initializes). That means if you
have a SMP system and two userlevel processes wants to write to this memory,
one will have to wait for the other before doing the memcpy (yep I'm using what
you can call PIO). This works great for two userlevel processes.

Now the reason for my question is; if also I have a kernel thread wanting to
write to this memory space it will also have to wait for the same lock (though
not mmapped since we are already in kernel space and can access the lock page
directly). What happens, is that if a userlevel process holds this lock and the
kernel thread gets scheduled and tries to get the same lock it will deadlock
because the userlevel process never gets back control and releases the lock
(kinda like when you spinlock in interrupt level on a lock wich is locked
without spinlock_irq). Is this because the kernel thread has higher priority
than the user level process (it has a nice level of -20) ?

Best regards,
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Tel  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Tel  : (+1) 713 706 0544   10500 Richmond Avenue, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: APIC errors ...

2001-04-19 Thread Steffen Persvold

Chris Wedgwood wrote:
> 
> On Wed, Apr 18, 2001 at 09:27:12PM -0500, Rico Tudor wrote:
> 
> Another problem area is ECC monitoring.  I'm still waiting for
> info from ServerWorks, and so is Dan Hollis.  Alexander Stohr has
> even submitted code to Jim Foster for approval, without evident
> effect.  I have 18GB of RAM divided among five ServerWorks boxes,
> so the matter is not academic.
> 
> Add environemt monitoring. One mf my play machines is a dell 2540,
> dual AC power, lots os fans and temperature sensing, I'd really like
> to be able to get this information from it (yeah, closed source Dell
> drivers are worth almost zero).
> 

This must be a Dell issue then, because I wrote a lm_sensors 
(http://www.netroedge.com/~lm78/) driver for the
ServerWorks OSB4 (SouthBridge) some time ago and they have merged it with the PIIX4 
driver. lm_sensors 2.5.5 and above
should have support for the ServerWorks System Management Bus. I have been running 
lm_sensors 2.5.5 on several mobos
with ServerWorks chipset of all kinds (LE, HE, HE-SL) and most of them work with the 
PIIX4 driver (with OSB4 support).
The only one I've had problems with so far, is the Compaq DL360 which seem to have 
disabled the SMB on the OSB4 and
instead using another approach (proprietary). This could be the problem with the Dell 
machines too (2450, 2550, 1550).

Best regards
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Pho  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Pho  : (+1) 713 706 0544   10500 Richmond Ave, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: APIC errors ...

2001-04-19 Thread Steffen Persvold


Chris Wedgwood wrote:
 
 On Wed, Apr 18, 2001 at 09:27:12PM -0500, Rico Tudor wrote:
 
 Another problem area is ECC monitoring.  I'm still waiting for
 info from ServerWorks, and so is Dan Hollis.  Alexander Stohr has
 even submitted code to Jim Foster for approval, without evident
 effect.  I have 18GB of RAM divided among five ServerWorks boxes,
 so the matter is not academic.
 
 Add environemt monitoring. One mf my play machines is a dell 2540,
 dual AC power, lots os fans and temperature sensing, I'd really like
 to be able to get this information from it (yeah, closed source Dell
 drivers are worth almost zero).
 

This must be a Dell issue then, because I wrote a lm_sensors 
(http://www.netroedge.com/~lm78/) driver for the
ServerWorks OSB4 (SouthBridge) some time ago and they have merged it with the PIIX4 
driver. lm_sensors 2.5.5 and above
should have support for the ServerWorks System Management Bus. I have been running 
lm_sensors 2.5.5 on several mobos
with ServerWorks chipset of all kinds (LE, HE, HE-SL) and most of them work with the 
PIIX4 driver (with OSB4 support).
The only one I've had problems with so far, is the Compaq DL360 which seem to have 
disabled the SMB on the OSB4 and
instead using another approach (proprietary). This could be the problem with the Dell 
machines too (2450, 2550, 1550).

Best regards
-- 
 Steffen PersvoldSystems Engineer
 Email  : mailto:[EMAIL PROTECTED]Scali AS (http://www.scali.com)
 Norway : Pho  : (+47) 2262 8950 Olaf Helsets vei 6
  Fax  : (+47) 2262 8951 N-0621 Oslo, Norway

 USA: Pho  : (+1) 713 706 0544   10500 Richmond Ave, Suite 190
 Houston, Texas 77042, USA
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.4.3 and Alpha

2001-04-13 Thread Steffen Persvold


Hi,

Any particular reasons why a stock 2.4.3 kernel doesn't have mm.h and
pgalloc.h in sync on Alpha. This is what I get :

# make boot
gcc -D__KERNEL__ -I/usr/src/redhat/linux/include -Wall
-Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe
-mno-fp-regs -ffixed-8 -mcpu=ev5 -Wa,-mev6   -c -o init/main.o
init/main.c
In file included from /usr/src/redhat/linux/include/linux/highmem.h:5,
 from /usr/src/redhat/linux/include/linux/pagemap.h:16,
 from /usr/src/redhat/linux/include/linux/locks.h:8,
 from /usr/src/redhat/linux/include/linux/raid/md.h:36,
 from init/main.c:24:
/usr/src/redhat/linux/include/asm/pgalloc.h:334: conflicting types for
`pte_alloc'
/usr/src/redhat/linux/include/linux/mm.h:399: previous declaration of
`pte_alloc'
/usr/src/redhat/linux/include/asm/pgalloc.h:352: conflicting types for
`pmd_alloc'
/usr/src/redhat/linux/include/linux/mm.h:412: previous declaration of
`pmd_alloc'
make: *** [init/main.o] Error 1


2.4.1 compiled fine, and as far as I can see, some changes has been made
to mm.h since then. I think these changes was followed up by i386, ppc,
s390 and sparc64 kernels but not on others. Any plans on when this is
done ?

Best regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

2.4.3 and Alpha

2001-04-13 Thread Steffen Persvold


Hi,

Any particular reasons why a stock 2.4.3 kernel doesn't have mm.h and
pgalloc.h in sync on Alpha. This is what I get :

# make boot
gcc -D__KERNEL__ -I/usr/src/redhat/linux/include -Wall
-Wstrict-prototypes -O2 -fomit-frame-pointer -fno-strict-aliasing -pipe
-mno-fp-regs -ffixed-8 -mcpu=ev5 -Wa,-mev6   -c -o init/main.o
init/main.c
In file included from /usr/src/redhat/linux/include/linux/highmem.h:5,
 from /usr/src/redhat/linux/include/linux/pagemap.h:16,
 from /usr/src/redhat/linux/include/linux/locks.h:8,
 from /usr/src/redhat/linux/include/linux/raid/md.h:36,
 from init/main.c:24:
/usr/src/redhat/linux/include/asm/pgalloc.h:334: conflicting types for
`pte_alloc'
/usr/src/redhat/linux/include/linux/mm.h:399: previous declaration of
`pte_alloc'
/usr/src/redhat/linux/include/asm/pgalloc.h:352: conflicting types for
`pmd_alloc'
/usr/src/redhat/linux/include/linux/mm.h:412: previous declaration of
`pmd_alloc'
make: *** [init/main.o] Error 1


2.4.1 compiled fine, and as far as I can see, some changes has been made
to mm.h since then. I think these changes was followed up by i386, ppc,
s390 and sparc64 kernels but not on others. Any plans on when this is
done ?

Best regards,
-- 
  Steffen Persvold   Systems Engineer
  Email : mailto:[EMAIL PROTECTED] Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50  Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51  N-0621 Oslo, Norway
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

52 matches

Mail list logo