Re: Linux 2.6.39-rc3
On Wednesday, April 13, 2011, Linus Torvalds wrote: > On Wednesday, April 13, 2011, H. Peter Anvin wrote: >> >> Yes. However, even if we *do* revert (and the time is running short on >> not reverting) I would like to understand this particular one, simply >> because I think it may very well be a problem that is manifesting itself >> in other ways on other systems. sorry, fingerfart. Anyway, I agree 100%. we definitely want to also understand the reason for things not working, even if we do revert.. Linus >> of complete b*llsh*t magic numbers in this > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Wednesday, April 13, 2011, Linus Torvalds wrote: > On Wednesday, April 13, 2011, H. Peter Anvin wrote: >> >> Yes. ?However, even if we *do* revert (and the time is running short on >> not reverting) I would like to understand this particular one, simply >> because I think it may very well be a problem that is manifesting itself >> in other ways on other systems. sorry, fingerfart. Anyway, I agree 100%. we definitely want to also understand the reason for things not working, even if we do revert.. Linus >> of complete b*llsh*t magic numbers in this >
Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:59 AM, Jerome Glisse wrote: > On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher > wrote: >> On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse >> wrote: >>> On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher >>> wrote: On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > I can't think of anything off hand. ?It might be worth disabling the call to r600_ib_test() in r600_init() and then seeing if you get any errors when the fences are used later on when X starts or just at that point in the module load sequence. ?What's odd is that when you tested radeon.no_wb=1 you got the same behavior as that disables shadowing of fence writes to gpu gart mem, so it wouldn't be writing to memory in that case. Alex >>> >>> It might be the irq ring write that is faulty. >> >> That's disabled with no_wb=1 as well. >> >> Alex >> > > I mean the irq interrupt ring, i don't see this being disabled when no_wb=1 I meant the IH ring pointer writeback. The ih ring itself is still in gart memory. Alex > > Cheers, > Jerome >
Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher wrote: > On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse wrote: >> On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher >> wrote: >>> On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > If you want to go the printk way you can add printk before each test > ring_test, ib_test in r600.c this 2 functions are the own that might > trigger the first GPU gart activities. Okay, I found the place in source that triggers this. It happens in the function r600_ib_test. The interesting thing is that not the ib-command itself is responsible but the fence that is emitted afterwards (proved by removing the fence command, where the problem went away). I don't know enough about the command semantics to make a guess what goes wrong there. But maybe you GPU folks have an idea? >>> >>> I can't think of anything off hand. ?It might be worth disabling the >>> call to r600_ib_test() in r600_init() and then seeing if you get any >>> errors when the fences are used later on when X starts or just at that >>> point in the module load sequence. ?What's odd is that when you tested >>> radeon.no_wb=1 you got the same behavior as that disables shadowing of >>> fence writes to gpu gart mem, so it wouldn't be writing to memory in >>> that case. >>> >>> Alex >>> >> >> It might be the irq ring write that is faulty. > > That's disabled with no_wb=1 as well. > > Alex > I mean the irq interrupt ring, i don't see this being disabled when no_wb=1 Cheers, Jerome
Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse wrote: > On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher > wrote: >> On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: >>> On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: >>> If you want to go the printk way you can add printk before each test ring_test, ib_test in r600.c this 2 functions are the own that might trigger the first GPU gart activities. >>> >>> Okay, I found the place in source that triggers this. It happens in the >>> function r600_ib_test. The interesting thing is that not the ib-command >>> itself is responsible but the fence that is emitted afterwards (proved >>> by removing the fence command, where the problem went away). >>> I don't know enough about the command semantics to make a guess what >>> goes wrong there. But maybe you GPU folks have an idea? >>> >> >> I can't think of anything off hand. ?It might be worth disabling the >> call to r600_ib_test() in r600_init() and then seeing if you get any >> errors when the fences are used later on when X starts or just at that >> point in the module load sequence. ?What's odd is that when you tested >> radeon.no_wb=1 you got the same behavior as that disables shadowing of >> fence writes to gpu gart mem, so it wouldn't be writing to memory in >> that case. >> >> Alex >> > > It might be the irq ring write that is faulty. That's disabled with no_wb=1 as well. Alex > > Cheers, > Jerome >
Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher wrote: > On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: >> On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: >> >>> If you want to go the printk way you can add printk before each test >>> ring_test, ib_test in r600.c this 2 functions are the own that might >>> trigger the first GPU gart activities. >> >> Okay, I found the place in source that triggers this. It happens in the >> function r600_ib_test. The interesting thing is that not the ib-command >> itself is responsible but the fence that is emitted afterwards (proved >> by removing the fence command, where the problem went away). >> I don't know enough about the command semantics to make a guess what >> goes wrong there. But maybe you GPU folks have an idea? >> > > I can't think of anything off hand. ?It might be worth disabling the > call to r600_ib_test() in r600_init() and then seeing if you get any > errors when the fences are used later on when X starts or just at that > point in the module load sequence. ?What's odd is that when you tested > radeon.no_wb=1 you got the same behavior as that disables shadowing of > fence writes to gpu gart mem, so it wouldn't be writing to memory in > that case. > > Alex > It might be the irq ring write that is faulty. Cheers, Jerome
Linux 2.6.39-rc3
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > I can't think of anything off hand. It might be worth disabling the call to r600_ib_test() in r600_init() and then seeing if you get any errors when the fences are used later on when X starts or just at that point in the module load sequence. What's odd is that when you tested radeon.no_wb=1 you got the same behavior as that disables shadowing of fence writes to gpu gart mem, so it wouldn't be writing to memory in that case. Alex > ? ? ? ?Joerg > > ___ > dri-devel mailing list > dri-devel at lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel >
Re: Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:59 AM, Jerome Glisse wrote: > On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher wrote: >> On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse wrote: >>> On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher >>> wrote: On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > I can't think of anything off hand. It might be worth disabling the call to r600_ib_test() in r600_init() and then seeing if you get any errors when the fences are used later on when X starts or just at that point in the module load sequence. What's odd is that when you tested radeon.no_wb=1 you got the same behavior as that disables shadowing of fence writes to gpu gart mem, so it wouldn't be writing to memory in that case. Alex >>> >>> It might be the irq ring write that is faulty. >> >> That's disabled with no_wb=1 as well. >> >> Alex >> > > I mean the irq interrupt ring, i don't see this being disabled when no_wb=1 I meant the IH ring pointer writeback. The ih ring itself is still in gart memory. Alex > > Cheers, > Jerome > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:33 AM, Alex Deucher wrote: > On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse wrote: >> On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher wrote: >>> On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > If you want to go the printk way you can add printk before each test > ring_test, ib_test in r600.c this 2 functions are the own that might > trigger the first GPU gart activities. Okay, I found the place in source that triggers this. It happens in the function r600_ib_test. The interesting thing is that not the ib-command itself is responsible but the fence that is emitted afterwards (proved by removing the fence command, where the problem went away). I don't know enough about the command semantics to make a guess what goes wrong there. But maybe you GPU folks have an idea? >>> >>> I can't think of anything off hand. It might be worth disabling the >>> call to r600_ib_test() in r600_init() and then seeing if you get any >>> errors when the fences are used later on when X starts or just at that >>> point in the module load sequence. What's odd is that when you tested >>> radeon.no_wb=1 you got the same behavior as that disables shadowing of >>> fence writes to gpu gart mem, so it wouldn't be writing to memory in >>> that case. >>> >>> Alex >>> >> >> It might be the irq ring write that is faulty. > > That's disabled with no_wb=1 as well. > > Alex > I mean the irq interrupt ring, i don't see this being disabled when no_wb=1 Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:29 AM, Jerome Glisse wrote: > On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher wrote: >> On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: >>> On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: >>> If you want to go the printk way you can add printk before each test ring_test, ib_test in r600.c this 2 functions are the own that might trigger the first GPU gart activities. >>> >>> Okay, I found the place in source that triggers this. It happens in the >>> function r600_ib_test. The interesting thing is that not the ib-command >>> itself is responsible but the fence that is emitted afterwards (proved >>> by removing the fence command, where the problem went away). >>> I don't know enough about the command semantics to make a guess what >>> goes wrong there. But maybe you GPU folks have an idea? >>> >> >> I can't think of anything off hand. It might be worth disabling the >> call to r600_ib_test() in r600_init() and then seeing if you get any >> errors when the fences are used later on when X starts or just at that >> point in the module load sequence. What's odd is that when you tested >> radeon.no_wb=1 you got the same behavior as that disables shadowing of >> fence writes to gpu gart mem, so it wouldn't be writing to memory in >> that case. >> >> Alex >> > > It might be the irq ring write that is faulty. That's disabled with no_wb=1 as well. Alex > > Cheers, > Jerome > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Mon, Apr 18, 2011 at 11:23 AM, Alex Deucher wrote: > On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: >> On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: >> >>> If you want to go the printk way you can add printk before each test >>> ring_test, ib_test in r600.c this 2 functions are the own that might >>> trigger the first GPU gart activities. >> >> Okay, I found the place in source that triggers this. It happens in the >> function r600_ib_test. The interesting thing is that not the ib-command >> itself is responsible but the fence that is emitted afterwards (proved >> by removing the fence command, where the problem went away). >> I don't know enough about the command semantics to make a guess what >> goes wrong there. But maybe you GPU folks have an idea? >> > > I can't think of anything off hand. It might be worth disabling the > call to r600_ib_test() in r600_init() and then seeing if you get any > errors when the fences are used later on when X starts or just at that > point in the module load sequence. What's odd is that when you tested > radeon.no_wb=1 you got the same behavior as that disables shadowing of > fence writes to gpu gart mem, so it wouldn't be writing to memory in > that case. > > Alex > It might be the irq ring write that is faulty. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > I can't think of anything off hand. It might be worth disabling the call to r600_ib_test() in r600_init() and then seeing if you get any errors when the fences are used later on when X starts or just at that point in the module load sequence. What's odd is that when you tested radeon.no_wb=1 you got the same behavior as that disables shadowing of fence writes to gpu gart mem, so it wouldn't be writing to memory in that case. Alex > Joerg > > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > > ? ? ? ?Joerg > > I can't think of any theory, at that point the wb, irq ring, cp buffer & ib pool are all allocated and pinned into gtt so they all have valid entry backed by a real page. Maybe the GART flush & update is seriously buggy but i expect we would have been hurt sooner by such things. Maybe there is a bug in the hw... wouldn't be surprised. Will try to think to crazy theory. Cheers, Jerome
Re: Linux 2.6.39-rc3
On Sun, Apr 17, 2011 at 10:09 AM, Joerg Roedel wrote: > On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > >> If you want to go the printk way you can add printk before each test >> ring_test, ib_test in r600.c this 2 functions are the own that might >> trigger the first GPU gart activities. > > Okay, I found the place in source that triggers this. It happens in the > function r600_ib_test. The interesting thing is that not the ib-command > itself is responsible but the fence that is emitted afterwards (proved > by removing the fence command, where the problem went away). > I don't know enough about the command semantics to make a guess what > goes wrong there. But maybe you GPU folks have an idea? > > Joerg > > I can't think of any theory, at that point the wb, irq ring, cp buffer & ib pool are all allocated and pinned into gtt so they all have valid entry backed by a real page. Maybe the GART flush & update is seriously buggy but i expect we would have been hurt sooner by such things. Maybe there is a bug in the hw... wouldn't be surprised. Will try to think to crazy theory. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > If you want to go the printk way you can add printk before each test > ring_test, ib_test in r600.c this 2 functions are the own that might > trigger the first GPU gart activities. Okay, I found the place in source that triggers this. It happens in the function r600_ib_test. The interesting thing is that not the ib-command itself is responsible but the fence that is emitted afterwards (proved by removing the fence command, where the problem went away). I don't know enough about the command semantics to make a guess what goes wrong there. But maybe you GPU folks have an idea? Joerg
Re: Linux 2.6.39-rc3
On Sat, Apr 16, 2011 at 02:54:04PM -0400, Jerome Glisse wrote: > If you want to go the printk way you can add printk before each test > ring_test, ib_test in r600.c this 2 functions are the own that might > trigger the first GPU gart activities. Okay, I found the place in source that triggers this. It happens in the function r600_ib_test. The interesting thing is that not the ib-command itself is responsible but the fence that is emitted afterwards (proved by removing the fence command, where the problem went away). I don't know enough about the command semantics to make a guess what goes wrong there. But maybe you GPU folks have an idea? Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote: > Do you also got the write if you load radeon with radeon.no_wb=1 ? > I think at this address it's the wb page, or maybe the cp as wb likely > take only one page radeon.no_wb=1 makes no difference. The box still reboots. Joerg
Linux 2.6.39-rc3
On Sat, Apr 16, 2011 at 12:35 PM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote: >> Do you also got the write if you load radeon with radeon.no_wb=1 ? >> I think at this address it's the wb page, or maybe the cp as wb likely >> take only one page > > radeon.no_wb=1 makes no difference. The box still reboots. > > ? ? ? ?Joerg > > If you want to go the printk way you can add printk before each test ring_test, ib_test in r600.c this 2 functions are the own that might trigger the first GPU gart activities. Cheers, Jerome
Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > > > * Alexandre Demers wrote: > > > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > > >> the other patches? > > > > Yes, apply it just on -rc3 without any other patch. > > > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > > >> This could be useful in the future. > > > > Cool, thanks > > > > > > > > > > > > Joerg > > > The patch was applied and tested. It looks fine, I'm able to boot > > > without problem. > > > > Joerg, mind submitting it with a changelog that includes everything we > > learned > > about this bug and all the Tested-by's in place? > > Looks like I am too late, it is already applied. But the changelog > contains a link to the korg-bugzilla which has all information too. So > the information is not lost. Yeah. In this case getting the fix into -rc4 in a timely manner looked more important than waiting for an updated changelog :-) Thanks, Ingo
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 12:18:02PM -0700, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > > > > Joerg, mind submitting it with a changelog that includes everything we > > learned > > about this bug and all the Tested-by's in place? > > > > Is anyone of the opinion that we should try to revert the allocation > > order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU > before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which > function cause the problem? I am not a GPU expert, but I will see what I can find out. Joerg
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > * Alexandre Demers wrote: > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > >> the other patches? > > > Yes, apply it just on -rc3 without any other patch. > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > >> This could be useful in the future. > > > Cool, thanks > > > > > > > > > Joerg > > The patch was applied and tested. It looks fine, I'm able to boot > > without problem. > > Joerg, mind submitting it with a changelog that includes everything we > learned > about this bug and all the Tested-by's in place? Looks like I am too late, it is already applied. But the changelog contains a link to the korg-bugzilla which has all information too. So the information is not lost. Joerg
Re: Linux 2.6.39-rc3
On Sat, Apr 16, 2011 at 12:35 PM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote: >> Do you also got the write if you load radeon with radeon.no_wb=1 ? >> I think at this address it's the wb page, or maybe the cp as wb likely >> take only one page > > radeon.no_wb=1 makes no difference. The box still reboots. > > Joerg > > If you want to go the printk way you can add printk before each test ring_test, ib_test in r600.c this 2 functions are the own that might trigger the first GPU gart activities. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 12:11:28PM -0400, Jerome Glisse wrote: > Do you also got the write if you load radeon with radeon.no_wb=1 ? > I think at this address it's the wb page, or maybe the cp as wb likely > take only one page radeon.no_wb=1 makes no difference. The box still reboots. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > > > * Alexandre Demers wrote: > > > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > > >> the other patches? > > > > Yes, apply it just on -rc3 without any other patch. > > > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > > >> This could be useful in the future. > > > > Cool, thanks > > > > > > > > > > > > Joerg > > > The patch was applied and tested. It looks fine, I'm able to boot > > > without problem. > > > > Joerg, mind submitting it with a changelog that includes everything we > > learned > > about this bug and all the Tested-by's in place? > > Looks like I am too late, it is already applied. But the changelog > contains a link to the korg-bugzilla which has all information too. So > the information is not lost. Yeah. In this case getting the fix into -rc4 in a timely manner looked more important than waiting for an updated changelog :-) Thanks, Ingo ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 12:18:02PM -0700, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > > > > Joerg, mind submitting it with a changelog that includes everything we > > learned > > about this bug and all the Tested-by's in place? > > > > Is anyone of the opinion that we should try to revert the allocation > > order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU > before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which > function cause the problem? I am not a GPU expert, but I will see what I can find out. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 09:06:41PM +0200, Ingo Molnar wrote: > > * Alexandre Demers wrote: > > > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > > >> the other patches? > > > Yes, apply it just on -rc3 without any other patch. > > > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > > >> This could be useful in the future. > > > Cool, thanks > > > > > > > > > Joerg > > The patch was applied and tested. It looks fine, I'm able to boot > > without problem. > > Joerg, mind submitting it with a changelog that includes everything we > learned > about this bug and all the Tested-by's in place? Looks like I am too late, it is already applied. But the changelog contains a link to the korg-bugzilla which has all information too. So the information is not lost. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
* Alexandre Demers wrote: > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > >> the other patches? > > Yes, apply it just on -rc3 without any other patch. > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > >> This could be useful in the future. > > Cool, thanks > > > > > > Joerg > The patch was applied and tested. It looks fine, I'm able to boot > without problem. Joerg, mind submitting it with a changelog that includes everything we learned about this bug and all the Tested-by's in place? Is anyone of the opinion that we should try to revert the allocation order/alignment changes in addition to this fix? Thanks, Ingo
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? Well, thinking again about this, the GPU likely generated this DMA request before too (which has an address in the range configured for the GTT on the card), but nobody noticed because they just hit main memory. And with the allocation changes in 39-rc1 the GART aperture started to be on the same address as the GTT (in their respective address spaces) so that the DMA request hit the GART. This caused the MCE and the sync-flood. The open question is why the GPU generates a DMA request with an address that is configured as the GTT base (+1 page) on the card. Joerg
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > >> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > >> > And this makes a difference, with this change on-top of -rc3 the box > >> > boots > >> > fine. So there seems to be some dependency between the GART base and the > >> > GTT > >> > base even when they are in different address spaces. > >> > > >> > Alex, can you comment on this? > >> > >> As Dave said, they are completely different addresses spaces. ?You > >> could put the GPU aperture at 0 if you wanted (in fact we do on some > >> chips). ?Perhaps there's some strange interaction with the nb gart > >> since the nb gart on that chipset was designed to be used for graphics > >> and the rs780/880 can be configured to use an agp aperture. > >> Unfortunately, I'm not that familiar with the nb gart. > > > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > I know what it's for. In the IGP graphics chip is also part of the > north bridge, but it may not be related at all. > > > > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. I havn't tested with an hd5000 > > yet. The failing notebook has an hd4200 mobility. > > What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It > it's an IGP, it's identical to the hd4200 programming-wise. BTW, first of all the other notebook had a different CPU (it's family 0fh and Joerg's is family 10h). So different CPUs different GARTs different issues ;-) (Furthermore for CPU family 0fh reporting of GartTblWalk errors is already switched off in arch/x86/kernel/cpu/mcheck/mce.c.) Andreas
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? No, I must admit that I lack enough knowledge about the GPU hardware to make an guess how this tanslation-request happened. All I can tell is the address that was reported in the MCE, it is 0xa0001000 (==the second page of the GART aperture). Maybe Alex can help here. Alex, may it be possible that the GPU generates DMA requests in the GTT area before the GTT is activated (or the activation is completed)? Or can you imagine any other reason? Joerg
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 04:04:45PM +0200, Andreas Herrmann wrote: > What about tagging this patch for stable/longterm releases? > > Potentially there are other cases where certain combinations of > hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If > the BIOS doesn't follow the BKDG recommendation to mask these errors, > the system will hang/reboot. Thus I think having this quirk in .32 and > .38 (at least) is useful. Right, thats certainly a good idea. The problem is not specific to GPUs, any other device can trigger this too. Joerg
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > Ok, I'll test it today. Should I apply it on a clean rc3 without any of > the other patches? Yes, apply it just on -rc3 without any other patch. > > BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > This could be useful in the future. Cool, thanks Joerg
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:11:52PM +0200, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel Joerg, What about tagging this patch for stable/longterm releases? Potentially there are other cases where certain combinations of hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If the BIOS doesn't follow the BKDG recommendation to mask these errors, the system will hang/reboot. Thus I think having this quirk in .32 and .38 (at least) is useful. Andreas
Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. Ok, but how did the allocation changes start triggering this error in v2.6.39-rc1? There must still be some layout specific thing here, right? Do we understand the details of that as well? Thanks, Ingo
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > we definitely want to also understand the reason for things not > working, even if we do revert.. Okay, here it is. After experimenting with different configurations for the north-bridge it turned out that a GART related MCE fires at the time the machine reboots. BIOSes configure the machine to sync-flood in that case which causes a reboot. After decoding the MCE it turned out to be a GART TBL Wlk Error. Such errors can happen if devices (speculativly) access GART ranges mapped invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors at all. But unfortunatly some BIOSes (including the one on my laptop) forget to do this. Below is a patch which disables these errors if the BIOS didn't do it. It fixes the problem on my site. Alexandre, can you try this patch on your machine too, please? Regards, Joerg
Linux 2.6.39-rc3
On 11-04-15 10:27 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of >> the other patches? > Yes, apply it just on -rc3 without any other patch. > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? >> This could be useful in the future. > Cool, thanks > > > Joerg The patch was applied and tested. It looks fine, I'm able to boot without problem. -- Alexandre Demers
Re: Linux 2.6.39-rc3
On 04/15/2011 12:18 PM, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > >> >> Joerg, mind submitting it with a changelog that includes everything we >> learned >> about this bug and all the Tested-by's in place? >> >> Is anyone of the opinion that we should try to revert the allocation >> order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU > before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which > function cause the problem? > Yes, I would like to make sure we don't just paper over a real bug (again). I think we still should talk Joerg's patch since it seems to be the right thing to do anyway, but I do want to make sure we don't have a memory-overwrite bug in the kernel that we're papering over. -hpa ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On 04/15/2011 12:18 PM, Yinghai Lu wrote: > On 04/15/2011 12:06 PM, Ingo Molnar wrote: > >> >> Joerg, mind submitting it with a changelog that includes everything we >> learned >> about this bug and all the Tested-by's in place? >> >> Is anyone of the opinion that we should try to revert the allocation >> order/alignment changes in addition to this fix? > > We should figure out what is written to 0xa0001000 (main memory) by GPU > before internal GART is setup. > > Joerg, > can you insert some dump code in the drm/radon code to find out which > function cause the problem? > Yes, I would like to make sure we don't just paper over a real bug (again). I think we still should talk Joerg's patch since it seems to be the right thing to do anyway, but I do want to make sure we don't have a memory-overwrite bug in the kernel that we're papering over. -hpa
Re: Linux 2.6.39-rc3
On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > Joerg, mind submitting it with a changelog that includes everything we > learned > about this bug and all the Tested-by's in place? > > Is anyone of the opinion that we should try to revert the allocation > order/alignment changes in addition to this fix? We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup. Joerg, can you insert some dump code in the drm/radon code to find out which function cause the problem? Thanks Yinghai ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On 04/15/2011 12:06 PM, Ingo Molnar wrote: > > Joerg, mind submitting it with a changelog that includes everything we > learned > about this bug and all the Tested-by's in place? > > Is anyone of the opinion that we should try to revert the allocation > order/alignment changes in addition to this fix? We should figure out what is written to 0xa0001000 (main memory) by GPU before internal GART is setup. Joerg, can you insert some dump code in the drm/radon code to find out which function cause the problem? Thanks Yinghai
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 11:46 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > Well, thinking again about this, the GPU likely generated this DMA > request before too (which has an address in the range configured for the > GTT on the card), but nobody noticed because they just hit main memory. > And with the allocation changes in 39-rc1 the GART aperture started to > be on the same address as the GTT (in their respective address spaces) > so that the DMA request hit the GART. This caused the MCE and the > sync-flood. > The open question is why the GPU generates a DMA request with an address > that is configured as the GTT base (+1 page) on the card. > > ? ? ? ?Joerg > Do you also got the write if you load radeon with radeon.no_wb=1 ? I think at this address it's the wb page, or maybe the cp as wb likely take only one page Cheers, Jerome
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:33 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > No, I must admit that I lack enough knowledge about the GPU hardware to > make an guess how this tanslation-request happened. All I can tell is > the address that was reported in the MCE, it is 0xa0001000 (==the second > page of the GART aperture). > > Maybe Alex can help here. Alex, may it be possible that the GPU > generates DMA requests in the GTT area before the GTT is activated (or > the activation is completed)? Or can you imagine any other reason? It shouldn't. The driver binds a dummy page to all entries in the table at init time and whenever the actual pages are unbound. Alex
Re: Linux 2.6.39-rc3
* Alexandre Demers wrote: > On 11-04-15 10:27 AM, Joerg Roedel wrote: > > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of > >> the other patches? > > Yes, apply it just on -rc3 without any other patch. > > > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > >> This could be useful in the future. > > Cool, thanks > > > > > > Joerg > The patch was applied and tested. It looks fine, I'm able to boot > without problem. Joerg, mind submitting it with a changelog that includes everything we learned about this bug and all the Tested-by's in place? Is anyone of the opinion that we should try to revert the allocation order/alignment changes in addition to this fix? Thanks, Ingo ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On 11-04-15 10:27 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: >> Ok, I'll test it today. Should I apply it on a clean rc3 without any of >> the other patches? > Yes, apply it just on -rc3 without any other patch. > >> BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? >> This could be useful in the future. > Cool, thanks > > > Joerg The patch was applied and tested. It looks fine, I'm able to boot without problem. -- Alexandre Demers ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:26:34AM +0200, Michel D?nzer wrote: > On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: > > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > > > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > > > And this makes a difference, with this change on-top of -rc3 the box > > > > boots > > > > fine. So there seems to be some dependency between the GART base and > > > > the GTT > > > > base even when they are in different address spaces. > > > > > > > > Alex, can you comment on this? > > > > > > As Dave said, they are completely different addresses spaces. You > > > could put the GPU aperture at 0 if you wanted (in fact we do on some > > > chips). Perhaps there's some strange interaction with the nb gart > > > since the nb gart on that chipset was designed to be used for graphics > > > and the rs780/880 can be configured to use an agp aperture. > > > Unfortunately, I'm not that familiar with the nb gart. > > > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. > > Wasn't the working theory that the problem occurs if those two values > aren't the same? Yes it is, but this doesn't seem to be problematic on all readeon GPU chips. Joerg
Linux 2.6.39-rc3
On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > > And this makes a difference, with this change on-top of -rc3 the box boots > > > fine. So there seems to be some dependency between the GART base and the > > > GTT > > > base even when they are in different address spaces. > > > > > > Alex, can you comment on this? > > > > As Dave said, they are completely different addresses spaces. You > > could put the GPU aperture at 0 if you wanted (in fact we do on some > > chips). Perhaps there's some strange interaction with the nb gart > > since the nb gart on that chipset was designed to be used for graphics > > and the rs780/880 can be configured to use an agp aperture. > > Unfortunately, I'm not that familiar with the nb gart. > > Actually, the nb gart is part of the cpu. It is part of the cpu north > bridge and can translate io and cpu accesses. In fact, it is a remapper > of physical memory addresses. > > The problem seems to be related to specific gpu chips. On another > notebook with an hd3000 card gtt and the nb gart aperture are both on > 0xa000 too but the box works fine. Wasn't the working theory that the problem occurs if those two values aren't the same? -- Earthling Michel D?nzer |http://www.vmware.com Libre software enthusiast | Debian, X and DRI developer
Linux 2.6.39-rc3
On 11-04-15 09:11 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: >> we definitely want to also understand the reason for things not >> working, even if we do revert.. > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/msr-index.h |4 > arch/x86/kernel/cpu/amd.c| 19 +++ > 2 files changed, 23 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/msr-index.h > b/arch/x86/include/asm/msr-index.h > index fd5a1f3..3cce714 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -96,11 +96,15 @@ > #define MSR_IA32_MC0_ADDR0x0402 > #define MSR_IA32_MC0_MISC0x0403 > > +#define MSR_AMD64_MC0_MASK 0xc0010044 > + > #define MSR_IA32_MCx_CTL(x) (MSR_IA32_MC0_CTL + 4*(x)) > #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) > #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) > #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) > > +#define MSR_AMD64_MCx_MASK(x)(MSR_AMD64_MC0_MASK + (x)) > + > /* These are consecutive and not in the normal 4er MCE bank block */ > #define MSR_IA32_MC0_CTL20x0280 > #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c > index 3ecece0..3532d3b 100644 > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) > /* As a rule processors have APIC timer running in deep C states */ > if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) > set_cpu_cap(c, X86_FEATURE_ARAT); > + > + /* > + * Disable GART TLB Walk Errors on Fam10h. We do this here > + * because this is always needed when GART is enabled, even in a > + * kernel which has no MCE support built in. > + */ > + if (c->x86 == 0x10) { > + /* > + * BIOS should disable GartTlbWlk Errors themself. If > + * it doesn't do it here as suggested by the BKDG. > + * > + * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 > + */ > + u64 mask; > + > + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); > + mask |= (1 << 10); > + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); > + } > } > > #ifdef CONFIG_X86_32 Ok, I'll test it today. Should I apply it on a clean rc3 without any of the other patches? BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? This could be useful in the future. I'll keep you up to date. -- Alexandre Demers
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > >> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > >> > And this makes a difference, with this change on-top of -rc3 the box > >> > boots > >> > fine. So there seems to be some dependency between the GART base and the > >> > GTT > >> > base even when they are in different address spaces. > >> > > >> > Alex, can you comment on this? > >> > >> As Dave said, they are completely different addresses spaces. You > >> could put the GPU aperture at 0 if you wanted (in fact we do on some > >> chips). Perhaps there's some strange interaction with the nb gart > >> since the nb gart on that chipset was designed to be used for graphics > >> and the rs780/880 can be configured to use an agp aperture. > >> Unfortunately, I'm not that familiar with the nb gart. > > > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > I know what it's for. In the IGP graphics chip is also part of the > north bridge, but it may not be related at all. > > > > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. I havn't tested with an hd5000 > > yet. The failing notebook has an hd4200 mobility. > > What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It > it's an IGP, it's identical to the hd4200 programming-wise. BTW, first of all the other notebook had a different CPU (it's family 0fh and Joerg's is family 10h). So different CPUs different GARTs different issues ;-) (Furthermore for CPU family 0fh reporting of GartTblWalk errors is already switched off in arch/x86/kernel/cpu/mcheck/mce.c.) Andreas ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:11:52PM +0200, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel Joerg, What about tagging this patch for stable/longterm releases? Potentially there are other cases where certain combinations of hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If the BIOS doesn't follow the BKDG recommendation to mask these errors, the system will hang/reboot. Thus I think having this quirk in .32 and .38 (at least) is useful. Andreas ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 11:46 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > Well, thinking again about this, the GPU likely generated this DMA > request before too (which has an address in the range configured for the > GTT on the card), but nobody noticed because they just hit main memory. > And with the allocation changes in 39-rc1 the GART aperture started to > be on the same address as the GTT (in their respective address spaces) > so that the DMA request hit the GART. This caused the MCE and the > sync-flood. > The open question is why the GPU generates a DMA request with an address > that is configured as the GTT base (+1 page) on the card. > > Joerg > Do you also got the write if you load radeon with radeon.no_wb=1 ? I think at this address it's the wb page, or maybe the cp as wb likely take only one page Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:33 AM, Joerg Roedel wrote: > On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: >> Ok, but how did the allocation changes start triggering this error in >> v2.6.39-rc1? There must still be some layout specific thing here, right? >> Do we understand the details of that as well? > > No, I must admit that I lack enough knowledge about the GPU hardware to > make an guess how this tanslation-request happened. All I can tell is > the address that was reported in the MCE, it is 0xa0001000 (==the second > page of the GART aperture). > > Maybe Alex can help here. Alex, may it be possible that the GPU > generates DMA requests in the GTT area before the GTT is activated (or > the activation is completed)? Or can you imagine any other reason? It shouldn't. The driver binds a dummy page to all entries in the table at init time and whenever the actual pages are unbound. Alex ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > I know what it's for. In the IGP graphics chip is also part of the > north bridge, but it may not be related at all. Okay, just wanted to make clear that it is part of the CPU and not of the chipset :) > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. I havn't tested with an hd5000 > > yet. The failing notebook has an hd4200 mobility. > > What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It > it's an IGP, it's identical to the hd4200 programming-wise. It is an IGP card, an "ATI Technologies Inc RS780M/RS780MN [Radeon HD 3200 Graphics]" according to lspci. > > Btw. what happens if the gpu accesses an unmapped address in the gtt > > range? > > It's redirected to a dummy page. So there should be no issue too, this is a very weird bug. Joerg
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? Well, thinking again about this, the GPU likely generated this DMA request before too (which has an address in the range configured for the GTT on the card), but nobody noticed because they just hit main memory. And with the allocation changes in 39-rc1 the GART aperture started to be on the same address as the GTT (in their respective address spaces) so that the DMA request hit the GART. This caused the MCE and the sync-flood. The open question is why the GPU generates a DMA request with an address that is configured as the GTT base (+1 page) on the card. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 03:16:50PM +0200, Ingo Molnar wrote: > Ok, but how did the allocation changes start triggering this error in > v2.6.39-rc1? There must still be some layout specific thing here, right? > Do we understand the details of that as well? No, I must admit that I lack enough knowledge about the GPU hardware to make an guess how this tanslation-request happened. All I can tell is the address that was reported in the MCE, it is 0xa0001000 (==the second page of the GART aperture). Maybe Alex can help here. Alex, may it be possible that the GPU generates DMA requests in the GTT area before the GTT is activated (or the activation is completed)? Or can you imagine any other reason? Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 04:04:45PM +0200, Andreas Herrmann wrote: > What about tagging this patch for stable/longterm releases? > > Potentially there are other cases where certain combinations of > hardware(GPUs)/drivers/whatsoever might trigger a GartTlbWlkErr. If > the BIOS doesn't follow the BKDG recommendation to mask these errors, > the system will hang/reboot. Thus I think having this quirk in .32 and > .38 (at least) is useful. Right, thats certainly a good idea. The problem is not specific to GPUs, any other device can trigger this too. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:16:59AM -0400, Alexandre Demers wrote: > Ok, I'll test it today. Should I apply it on a clean rc3 without any of > the other patches? Yes, apply it just on -rc3 without any other patch. > > BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? > This could be useful in the future. Cool, thanks Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On 11-04-15 09:11 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: >> we definitely want to also understand the reason for things not >> working, even if we do revert.. > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. > > Alexandre, can you try this patch on your machine too, please? > > Regards, > > Joerg > > From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 > From: Joerg Roedel > Date: Fri, 15 Apr 2011 14:47:40 +0200 > Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it > > This patch disables GartTlbWlk errors on AMD Fam10h CPUs if > the BIOS forgets to do is (or is just too old). Letting > these errors enabled can cause a sync-flood on the CPU > causing a reboot. > > This patch is the fix for > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > on my machine. > > Signed-off-by: Joerg Roedel > --- > arch/x86/include/asm/msr-index.h |4 > arch/x86/kernel/cpu/amd.c| 19 +++ > 2 files changed, 23 insertions(+), 0 deletions(-) > > diff --git a/arch/x86/include/asm/msr-index.h > b/arch/x86/include/asm/msr-index.h > index fd5a1f3..3cce714 100644 > --- a/arch/x86/include/asm/msr-index.h > +++ b/arch/x86/include/asm/msr-index.h > @@ -96,11 +96,15 @@ > #define MSR_IA32_MC0_ADDR0x0402 > #define MSR_IA32_MC0_MISC0x0403 > > +#define MSR_AMD64_MC0_MASK 0xc0010044 > + > #define MSR_IA32_MCx_CTL(x) (MSR_IA32_MC0_CTL + 4*(x)) > #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) > #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) > #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) > > +#define MSR_AMD64_MCx_MASK(x)(MSR_AMD64_MC0_MASK + (x)) > + > /* These are consecutive and not in the normal 4er MCE bank block */ > #define MSR_IA32_MC0_CTL20x0280 > #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c > index 3ecece0..3532d3b 100644 > --- a/arch/x86/kernel/cpu/amd.c > +++ b/arch/x86/kernel/cpu/amd.c > @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) > /* As a rule processors have APIC timer running in deep C states */ > if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) > set_cpu_cap(c, X86_FEATURE_ARAT); > + > + /* > + * Disable GART TLB Walk Errors on Fam10h. We do this here > + * because this is always needed when GART is enabled, even in a > + * kernel which has no MCE support built in. > + */ > + if (c->x86 == 0x10) { > + /* > + * BIOS should disable GartTlbWlk Errors themself. If > + * it doesn't do it here as suggested by the BKDG. > + * > + * Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 > + */ > + u64 mask; > + > + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); > + mask |= (1 << 10); > + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); > + } > } > > #ifdef CONFIG_X86_32 Ok, I'll test it today. Should I apply it on a clean rc3 without any of the other patches? BTW, may I suggest adding the info under bug 33012 in kernel bugzilla? This could be useful in the future. I'll keep you up to date. -- Alexandre Demers ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > Okay, here it is. > > After experimenting with different configurations for the north-bridge > it turned out that a GART related MCE fires at the time the machine > reboots. BIOSes configure the machine to sync-flood in that case which > causes a reboot. > > After decoding the MCE it turned out to be a GART TBL Wlk Error. Such > errors can happen if devices (speculativly) access GART ranges mapped > invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors > at all. But unfortunatly some BIOSes (including the one on my laptop) > forget to do this. > > Below is a patch which disables these errors if the BIOS didn't do it. > It fixes the problem on my site. Ok, but how did the allocation changes start triggering this error in v2.6.39-rc1? There must still be some layout specific thing here, right? Do we understand the details of that as well? Thanks, Ingo ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > we definitely want to also understand the reason for things not > working, even if we do revert.. Okay, here it is. After experimenting with different configurations for the north-bridge it turned out that a GART related MCE fires at the time the machine reboots. BIOSes configure the machine to sync-flood in that case which causes a reboot. After decoding the MCE it turned out to be a GART TBL Wlk Error. Such errors can happen if devices (speculativly) access GART ranges mapped invalid. The AMD BKDG for Fam10h CPUs recommends to disable these errors at all. But unfortunatly some BIOSes (including the one on my laptop) forget to do this. Below is a patch which disables these errors if the BIOS didn't do it. It fixes the problem on my site. Alexandre, can you try this patch on your machine too, please? Regards, Joerg >From aaacff8db50b6ed4345e337ecbe53e505699c7e5 Mon Sep 17 00:00:00 2001 From: Joerg Roedel Date: Fri, 15 Apr 2011 14:47:40 +0200 Subject: [PATCH] x86/amd: Disable GartTlbWlkErr when BIOS forgets it This patch disables GartTlbWlk errors on AMD Fam10h CPUs if the BIOS forgets to do is (or is just too old). Letting these errors enabled can cause a sync-flood on the CPU causing a reboot. This patch is the fix for https://bugzilla.kernel.org/show_bug.cgi?id=33012 on my machine. Signed-off-by: Joerg Roedel --- arch/x86/include/asm/msr-index.h |4 arch/x86/kernel/cpu/amd.c| 19 +++ 2 files changed, 23 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h index fd5a1f3..3cce714 100644 --- a/arch/x86/include/asm/msr-index.h +++ b/arch/x86/include/asm/msr-index.h @@ -96,11 +96,15 @@ #define MSR_IA32_MC0_ADDR 0x0402 #define MSR_IA32_MC0_MISC 0x0403 +#define MSR_AMD64_MC0_MASK 0xc0010044 + #define MSR_IA32_MCx_CTL(x)(MSR_IA32_MC0_CTL + 4*(x)) #define MSR_IA32_MCx_STATUS(x) (MSR_IA32_MC0_STATUS + 4*(x)) #define MSR_IA32_MCx_ADDR(x) (MSR_IA32_MC0_ADDR + 4*(x)) #define MSR_IA32_MCx_MISC(x) (MSR_IA32_MC0_MISC + 4*(x)) +#define MSR_AMD64_MCx_MASK(x) (MSR_AMD64_MC0_MASK + (x)) + /* These are consecutive and not in the normal 4er MCE bank block */ #define MSR_IA32_MC0_CTL2 0x0280 #define MSR_IA32_MCx_CTL2(x) (MSR_IA32_MC0_CTL2 + (x)) diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c index 3ecece0..3532d3b 100644 --- a/arch/x86/kernel/cpu/amd.c +++ b/arch/x86/kernel/cpu/amd.c @@ -615,6 +615,25 @@ static void __cpuinit init_amd(struct cpuinfo_x86 *c) /* As a rule processors have APIC timer running in deep C states */ if (c->x86 >= 0xf && !cpu_has_amd_erratum(amd_erratum_400)) set_cpu_cap(c, X86_FEATURE_ARAT); + + /* +* Disable GART TLB Walk Errors on Fam10h. We do this here +* because this is always needed when GART is enabled, even in a +* kernel which has no MCE support built in. +*/ + if (c->x86 == 0x10) { + /* +* BIOS should disable GartTlbWlk Errors themself. If +* it doesn't do it here as suggested by the BKDG. +* +* Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=33012 +*/ + u64 mask; + + rdmsrl(MSR_AMD64_MCx_MASK(4), mask); + mask |= (1 << 10); + wrmsrl(MSR_AMD64_MCx_MASK(4), mask); + } } #ifdef CONFIG_X86_32 -- 1.7.1 ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Fri, Apr 15, 2011 at 10:26:34AM +0200, Michel Dänzer wrote: > On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: > > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > > > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > > > And this makes a difference, with this change on-top of -rc3 the box > > > > boots > > > > fine. So there seems to be some dependency between the GART base and > > > > the GTT > > > > base even when they are in different address spaces. > > > > > > > > Alex, can you comment on this? > > > > > > As Dave said, they are completely different addresses spaces. You > > > could put the GPU aperture at 0 if you wanted (in fact we do on some > > > chips). Perhaps there's some strange interaction with the nb gart > > > since the nb gart on that chipset was designed to be used for graphics > > > and the rs780/880 can be configured to use an agp aperture. > > > Unfortunately, I'm not that familiar with the nb gart. > > > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. > > Wasn't the working theory that the problem occurs if those two values > aren't the same? Yes it is, but this doesn't seem to be problematic on all readeon GPU chips. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Don, 2011-04-14 at 23:09 +0200, Joerg Roedel wrote: > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > > And this makes a difference, with this change on-top of -rc3 the box boots > > > fine. So there seems to be some dependency between the GART base and the > > > GTT > > > base even when they are in different address spaces. > > > > > > Alex, can you comment on this? > > > > As Dave said, they are completely different addresses spaces. You > > could put the GPU aperture at 0 if you wanted (in fact we do on some > > chips). Perhaps there's some strange interaction with the nb gart > > since the nb gart on that chipset was designed to be used for graphics > > and the rs780/880 can be configured to use an agp aperture. > > Unfortunately, I'm not that familiar with the nb gart. > > Actually, the nb gart is part of the cpu. It is part of the cpu north > bridge and can translate io and cpu accesses. In fact, it is a remapper > of physical memory addresses. > > The problem seems to be related to specific gpu chips. On another > notebook with an hd3000 card gtt and the nb gart aperture are both on > 0xa000 too but the box works fine. Wasn't the working theory that the problem occurs if those two values aren't the same? -- Earthling Michel Dänzer |http://www.vmware.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 05:34:46PM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > > Actually, the nb gart is part of the cpu. It is part of the cpu north > > bridge and can translate io and cpu accesses. In fact, it is a remapper > > of physical memory addresses. > > I know what it's for. In the IGP graphics chip is also part of the > north bridge, but it may not be related at all. Okay, just wanted to make clear that it is part of the CPU and not of the chipset :) > > The problem seems to be related to specific gpu chips. On another > > notebook with an hd3000 card gtt and the nb gart aperture are both on > > 0xa000 too but the box works fine. I havn't tested with an hd5000 > > yet. The failing notebook has an hd4200 mobility. > > What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It > it's an IGP, it's identical to the hd4200 programming-wise. It is an IGP card, an "ATI Technologies Inc RS780M/RS780MN [Radeon HD 3200 Graphics]" according to lspci. > > Btw. what happens if the gpu accesses an unmapped address in the gtt > > range? > > It's redirected to a dummy page. So there should be no issue too, this is a very weird bug. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > And this makes a difference, with this change on-top of -rc3 the box boots > > fine. So there seems to be some dependency between the GART base and the GTT > > base even when they are in different address spaces. > > > > Alex, can you comment on this? > > As Dave said, they are completely different addresses spaces. You > could put the GPU aperture at 0 if you wanted (in fact we do on some > chips). Perhaps there's some strange interaction with the nb gart > since the nb gart on that chipset was designed to be used for graphics > and the rs780/880 can be configured to use an agp aperture. > Unfortunately, I'm not that familiar with the nb gart. Actually, the nb gart is part of the cpu. It is part of the cpu north bridge and can translate io and cpu accesses. In fact, it is a remapper of physical memory addresses. The problem seems to be related to specific gpu chips. On another notebook with an hd3000 card gtt and the nb gart aperture are both on 0xa000 too but the box works fine. I havn't tested with an hd5000 yet. The failing notebook has an hd4200 mobility. Btw. what happens if the gpu accesses an unmapped address in the gtt range? Regards, Joerg
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 6:56 PM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: >> On 04/13/2011 12:14 PM, Yinghai Lu wrote: >> > >> > so looks bios program wrong address to the radon card? >> > >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. ?If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > ? ? ? ? ? ? ? ? ? ? ? ?mc->gtt_size = size_bf; > ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ?mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + ? ? ? ? ? ? ? if (mc->gtt_start == 0xa000) > + ? ? ? ? ? ? ? ? ? ? ? mc->gtt_start = 0x8000; > ? ? ? ?} else { > ? ? ? ? ? ? ? ?if (mc->gtt_size > size_af) { > ? ? ? ? ? ? ? ? ? ? ? ?dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? Wierd either a hw bug or some access to the GTT is leaking out before, things are setup properly, I think the RS780/880 docs are on the website, but generally the address spaces are completely separate so anything getting through is very unusual. Dave.
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: >> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: >> > And this makes a difference, with this change on-top of -rc3 the box boots >> > fine. So there seems to be some dependency between the GART base and the >> > GTT >> > base even when they are in different address spaces. >> > >> > Alex, can you comment on this? >> >> As Dave said, they are completely different addresses spaces. ?You >> could put the GPU aperture at 0 if you wanted (in fact we do on some >> chips). ?Perhaps there's some strange interaction with the nb gart >> since the nb gart on that chipset was designed to be used for graphics >> and the rs780/880 can be configured to use an agp aperture. >> Unfortunately, I'm not that familiar with the nb gart. > > Actually, the nb gart is part of the cpu. It is part of the cpu north > bridge and can translate io and cpu accesses. In fact, it is a remapper > of physical memory addresses. I know what it's for. In the IGP graphics chip is also part of the north bridge, but it may not be related at all. > > The problem seems to be related to specific gpu chips. On another > notebook with an hd3000 card gtt and the nb gart aperture are both on > 0xa000 too but the box works fine. I havn't tested with an hd5000 > yet. The failing notebook has an hd4200 mobility. What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It it's an IGP, it's identical to the hd4200 programming-wise. > > Btw. what happens if the gpu accesses an unmapped address in the gtt > range? It's redirected to a dummy page. Alex
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 5:09 PM, Joerg Roedel wrote: > On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: >> On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: >> > And this makes a difference, with this change on-top of -rc3 the box boots >> > fine. So there seems to be some dependency between the GART base and the >> > GTT >> > base even when they are in different address spaces. >> > >> > Alex, can you comment on this? >> >> As Dave said, they are completely different addresses spaces. You >> could put the GPU aperture at 0 if you wanted (in fact we do on some >> chips). Perhaps there's some strange interaction with the nb gart >> since the nb gart on that chipset was designed to be used for graphics >> and the rs780/880 can be configured to use an agp aperture. >> Unfortunately, I'm not that familiar with the nb gart. > > Actually, the nb gart is part of the cpu. It is part of the cpu north > bridge and can translate io and cpu accesses. In fact, it is a remapper > of physical memory addresses. I know what it's for. In the IGP graphics chip is also part of the north bridge, but it may not be related at all. > > The problem seems to be related to specific gpu chips. On another > notebook with an hd3000 card gtt and the nb gart aperture are both on > 0xa000 too but the box works fine. I havn't tested with an hd5000 > yet. The failing notebook has an hd4200 mobility. What exact model is the hd3000? Is it IGP GPU or a discrete GPU? It it's an IGP, it's identical to the hd4200 programming-wise. > > Btw. what happens if the gpu accesses an unmapped address in the gtt > range? It's redirected to a dummy page. Alex ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 10:28:43AM -0400, Alex Deucher wrote: > On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > > And this makes a difference, with this change on-top of -rc3 the box boots > > fine. So there seems to be some dependency between the GART base and the GTT > > base even when they are in different address spaces. > > > > Alex, can you comment on this? > > As Dave said, they are completely different addresses spaces. You > could put the GPU aperture at 0 if you wanted (in fact we do on some > chips). Perhaps there's some strange interaction with the nb gart > since the nb gart on that chipset was designed to be used for graphics > and the rs780/880 can be configured to use an agp aperture. > Unfortunately, I'm not that familiar with the nb gart. Actually, the nb gart is part of the cpu. It is part of the cpu north bridge and can translate io and cpu accesses. In fact, it is a remapper of physical memory addresses. The problem seems to be related to specific gpu chips. On another notebook with an hd3000 card gtt and the nb gart aperture are both on 0xa000 too but the box works fine. I havn't tested with an hd5000 yet. The failing notebook has an hd4200 mobility. Btw. what happens if the gpu accesses an unmapped address in the gtt range? Regards, Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
Hello, On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > On Wednesday, April 13, 2011, Linus Torvalds > wrote: > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > >> > >> Yes. ?However, even if we *do* revert (and the time is running short on > >> not reverting) I would like to understand this particular one, simply > >> because I think it may very well be a problem that is manifesting itself > >> in other ways on other systems. > > sorry, fingerfart. Anyway, I agree 100%. > > we definitely want to also understand the reason for things not > working, even if we do revert.. There were (and still are) places where memblock callers implemented ad-hoc top-down allocation by stepping down start limit until allocation succeeds. Several of them have been removed since top-down became the default behavior, so simply reverting the commit is likely to cause subtle issues. Maybe the best approach is introducing @topdown parameter and use it selectively for pure memory allocations. Thanks. -- tejun
Linux 2.6.39-rc3
On Wed, 2011-04-13 at 18:58 -0700, H. Peter Anvin wrote: > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > so those two patches uncover some problems. > > > > [0.00] Checking aperture... > > [0.00] No AGP bridge found > > [0.00] Node 0: aperture @ a000 size 32 MB > > [0.00] Aperture pointing to e820 RAM. Ignoring. > > [0.00] Your BIOS doesn't leave a aperture memory hole > > [0.00] Please enable the IOMMU option in the BIOS setup > > [0.00] This costs you 64 MB of RAM > > [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff] > > aperture64 > > [0.00] Mapping aperture over 65536 KB of RAM @ a000 > > > > so kernel try to reallocate apperture. because BIOS allocated is pointed to > > RAM or size is too small. > > > > but your radeon does use [0xa000, 0xbfff) > > > > [4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - > > 0xD3FF (320M used) > > [4.290672] radeon :01:05.0: GTT: 512M 0xA000 - > > 0xBFFF > > [4.298550] [drm] Detected VRAM RAM=320M, BAR=256M > > [4.309857] [drm] RAM width 32bits DDR > > [4.313748] [TTM] Zone kernel: Available graphics memory: 1896524 kiB. > > [4.320379] [TTM] Initializing pool allocator. > > [4.324948] [drm] radeon: 320M of VRAM memory ready > > [4.329832] [drm] radeon: 512M of GTT memory ready. > > > > and the one seems working: > > > > [0.00] Checking aperture... > > [0.00] No AGP bridge found > > [0.00] Node 0: aperture @ a000 size 32 MB > > [0.00] Aperture pointing to e820 RAM. Ignoring. > > [0.00] Your BIOS doesn't leave a aperture memory hole > > [0.00] Please enable the IOMMU option in the BIOS setup > > [0.00] This costs you 64 MB of RAM > > [0.00] memblock_x86_reserve_range: [0x8000-0x83ff] > > aperture64 > > [0.00] Mapping aperture over 65536 KB of RAM @ 8000 > > [0.00] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf] > > BOOTMEM > > > > will use different position... > > > > [4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - > > 0xD3FF (320M used) > > [4.258830] radeon :01:05.0: GTT: 512M 0xA000 - > > 0xBFFF > > [4.266742] [drm] Detected VRAM RAM=320M, BAR=256M > > [4.271549] [drm] RAM width 32bits DDR > > [4.275435] [TTM] Zone kernel: Available graphics memory: 1896526 kiB. > > [4.282066] [TTM] Initializing pool allocator. > > [4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd > > [4.293076] [drm] radeon: 320M of VRAM memory ready > > [4.298277] [drm] radeon: 512M of GTT memory ready. > > [4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). > > [4.309854] [drm] Driver supports precise vblank timestamp query. > > [4.315970] [drm] radeon: irq initialized. > > [4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072 > > > > So question is why radeon is using the address [0xa000 - 0xc00], > > and in E820 it is RAM > > > > [0.00] BIOS-e820: 0010 - acb8d000 (usable) > > [0.00] BIOS-e820: acb8d000 - acb8f000 (reserved) > > [0.00] BIOS-e820: acb8f000 - afce9000 (usable) > > [0.00] BIOS-e820: afce9000 - afd21000 (reserved) > > [0.00] BIOS-e820: afd21000 - afd4f000 (usable) > > [0.00] BIOS-e820: afd4f000 - afdcf000 (reserved) > > [0.00] BIOS-e820: afdcf000 - afecf000 (ACPI NVS) > > [0.00] BIOS-e820: afecf000 - afeff000 (ACPI data) > > [0.00] BIOS-e820: afeff000 - aff0 (usable) > > > > so looks bios program wrong address to the radon card? > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > over memory areas reserved by the BIOS. If I were to guess, I would say > that the problem here seems to be that the kernel thinks it is > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > Alex D., could you comment on the "num cpu pages" bit? These are not CPU addresses. I think we've stated that already. Not the droids. the num cpu pages is how many CPU pages would be needed to fill the GPU GTT, for those crazy cases where CPU pagesize != GPU pagesize. Dave.
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 01:03:37PM +0900, Tejun Heo wrote: > Hello, > > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > On Wednesday, April 13, 2011, Linus Torvalds > > wrote: > > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > > >> > > >> Yes. ?However, even if we *do* revert (and the time is running short on > > >> not reverting) I would like to understand this particular one, simply > > >> because I think it may very well be a problem that is manifesting itself > > >> in other ways on other systems. > > > > sorry, fingerfart. Anyway, I agree 100%. > > > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > There were (and still are) places where memblock callers implemented > ad-hoc top-down allocation by stepping down start limit until > allocation succeeds. Several of them have been removed since top-down > became the default behavior, so simply reverting the commit is likely > to cause subtle issues. Maybe the best approach is introducing > @topdown parameter and use it selectively for pure memory allocations. Wouldn't it be better to provide a seperate memblock allocation function which operates top-down and use this one in the places that need it? This way it wouldn't break code that relies on bottom-up. Joerg
Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: > > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > > > so looks bios program wrong address to the radon card? > > > > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > > over memory areas reserved by the BIOS. If I were to guess, I would say > > that the problem here seems to be that the kernel thinks it is > > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > > > Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > mc->gtt_size = size_bf; > } > mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + if (mc->gtt_start == 0xa000) > + mc->gtt_start = 0x8000; > } else { > if (mc->gtt_size > size_af) { > dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? I'd strongly suggest we revert back to the old and proven allocation order, as long as it results in valid layouts. Even if we figure out this particular GART/GTT assumption there might be a dozen others in other types of hardware. Thanks, Ingo
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 03:31:09PM -0700, H. Peter Anvin wrote: > On 04/13/2011 03:22 PM, Joerg Roedel wrote: > > On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote: > >> On 04/13/2011 02:50 PM, Joerg Roedel wrote: > >>> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote: > -addr = memblock_find_in_range(0, 1ULL<<32, aper_size, > 512ULL<<20); > +addr = memblock_find_in_range(0, 1ULL<<32, aper_size, > 512ULL<<21); > >>> > >>> Btw, while looking at this code I wondered why the 512M goal is enforced > >>> by the alignment. Start could be set to 512M instead and the alignment > >>> can be aper_size as it should. Any reason for such a big alignment? > >>> > >>> Joerg > >>> > >>> P.S.: The box is still in the office, I will try this debug-patch > >>> tomorrow. > >> > >> The only reason that I can think of is that the aperture itself can be > >> huge, and perhaps 512 MiB is the biggest such known. > > > > Well, that would work as well by just using aper_size as alignment, the > > aperture needs to be aligned on its size anyway. This code only runs > > when Linux allocates the aperture itself and if I am mistaken is uses > > always 64MB when doing this. > > Yes, I would agree with that. The sane thing would be to set the base > to whatever address needs to be guarded against (WHICH SHOULD BE > MOTIVATED), and use aper_size as alignment, *unless* we are only using > the initial portion of a much larger hardware structure that needs > natural alignment (which isn't clear to me, I do know we sometimes use > only a fraction of the GART, but that doesn't mean we need to > naturally-align the entire thing, nor that 512 MiB is sufficient to do so.) Whats allocated here is the address-space for the aperture. The code actually allocates the memory but all it needs is the physical address range. This range is later programmed into hardware as the GART aperture (the area the GART remaps). The Linux code can split the aperture if necessary for DMA-API usage and AGP usage. In that case both users get a half of the aperture and manage them itself. Joerg
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > so looks bios program wrong address to the radon card? > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > over memory areas reserved by the BIOS. If I were to guess, I would say > that the problem here seems to be that the kernel thinks it is > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > Alex D., could you comment on the "num cpu pages" bit? Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) mc->gtt_size = size_bf; } mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size; + if (mc->gtt_start == 0xa000) + mc->gtt_start = 0x8000; } else { if (mc->gtt_size > size_af) { dev_warn(rdev->dev, "limiting GTT\n"); And this makes a difference, with this change on-top of -rc3 the box boots fine. So there seems to be some dependency between the GART base and the GTT base even when they are in different address spaces. Alex, can you comment on this? Regards, Joerg
Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: >> On 04/13/2011 12:14 PM, Yinghai Lu wrote: >> > >> > so looks bios program wrong address to the radon card? >> > >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. ?If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > ? ? ? ? ? ? ? ? ? ? ? ?mc->gtt_size = size_bf; > ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ?mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + ? ? ? ? ? ? ? if (mc->gtt_start == 0xa000) > + ? ? ? ? ? ? ? ? ? ? ? mc->gtt_start = 0x8000; > ? ? ? ?} else { > ? ? ? ? ? ? ? ?if (mc->gtt_size > size_af) { > ? ? ? ? ? ? ? ? ? ? ? ?dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? As Dave said, they are completely different addresses spaces. You could put the GPU aperture at 0 if you wanted (in fact we do on some chips). Perhaps there's some strange interaction with the nb gart since the nb gart on that chipset was designed to be used for graphics and the rs780/880 can be configured to use an agp aperture. Unfortunately, I'm not that familiar with the nb gart. Alex > > Regards, > > ? ? ? ?Joerg > >
Linux 2.6.39-rc3
On Wed, 13 Apr 2011 19:33:40 -0700 Linus Torvalds wrote: > On Wednesday, April 13, 2011, Linus Torvalds > wrote: > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > >> > >> Yes. ?However, even if we *do* revert (and the time is running short on > >> not reverting) I would like to understand this particular one, simply > >> because I think it may very well be a problem that is manifesting itself > >> in other ways on other systems. > > sorry, fingerfart. Anyway, I agree 100%. > > we definitely want to also understand the reason for things not > working, even if we do revert.. Definitely because if it fails when the "magic" involves the GART base it starts to sound like something may be hitting the wrong address space or not flushing properly.
Re: Linux 2.6.39-rc3
On 04/14/2011 02:11 AM, Ingo Molnar wrote: > > I'd strongly suggest we revert back to the old and proven allocation order, > as > long as it results in valid layouts. Even if we figure out this particular > GART/GTT assumption there might be a dozen others in other types of hardware. > Yes, but we might also be hiding a real bug which bites other hardware. We have found real and very serious bugs in the kernel this way before -- things where drivers scribble over random memory and allocation order exposed the failure in a predictable way, as opposed to random crashes. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On 04/14/2011 02:11 AM, Ingo Molnar wrote: > > I'd strongly suggest we revert back to the old and proven allocation order, > as > long as it results in valid layouts. Even if we figure out this particular > GART/GTT assumption there might be a dozen others in other types of hardware. > Yes, but we might also be hiding a real bug which bites other hardware. We have found real and very serious bugs in the kernel this way before -- things where drivers scribble over random memory and allocation order exposed the failure in a predictable way, as opposed to random crashes. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 4:56 AM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: >> On 04/13/2011 12:14 PM, Yinghai Lu wrote: >> > >> > so looks bios program wrong address to the radon card? >> > >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > mc->gtt_size = size_bf; > } > mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + if (mc->gtt_start == 0xa000) > + mc->gtt_start = 0x8000; > } else { > if (mc->gtt_size > size_af) { > dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? As Dave said, they are completely different addresses spaces. You could put the GPU aperture at 0 if you wanted (in fact we do on some chips). Perhaps there's some strange interaction with the nb gart since the nb gart on that chipset was designed to be used for graphics and the rs780/880 can be configured to use an agp aperture. Unfortunately, I'm not that familiar with the nb gart. Alex > > Regards, > > Joerg > > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 01:03:37PM +0900, Tejun Heo wrote: > Hello, > > On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > > On Wednesday, April 13, 2011, Linus Torvalds > > wrote: > > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > > >> > > >> Yes. However, even if we *do* revert (and the time is running short on > > >> not reverting) I would like to understand this particular one, simply > > >> because I think it may very well be a problem that is manifesting itself > > >> in other ways on other systems. > > > > sorry, fingerfart. Anyway, I agree 100%. > > > > we definitely want to also understand the reason for things not > > working, even if we do revert.. > > There were (and still are) places where memblock callers implemented > ad-hoc top-down allocation by stepping down start limit until > allocation succeeds. Several of them have been removed since top-down > became the default behavior, so simply reverting the commit is likely > to cause subtle issues. Maybe the best approach is introducing > @topdown parameter and use it selectively for pure memory allocations. Wouldn't it be better to provide a seperate memblock allocation function which operates top-down and use this one in the places that need it? This way it wouldn't break code that relies on bottom-up. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
* Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: > > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > > > so looks bios program wrong address to the radon card? > > > > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > > over memory areas reserved by the BIOS. If I were to guess, I would say > > that the problem here seems to be that the kernel thinks it is > > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > > > Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > mc->gtt_size = size_bf; > } > mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + if (mc->gtt_start == 0xa000) > + mc->gtt_start = 0x8000; > } else { > if (mc->gtt_size > size_af) { > dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? I'd strongly suggest we revert back to the old and proven allocation order, as long as it results in valid layouts. Even if we figure out this particular GART/GTT assumption there might be a dozen others in other types of hardware. Thanks, Ingo ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Thu, Apr 14, 2011 at 6:56 PM, Joerg Roedel wrote: > On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: >> On 04/13/2011 12:14 PM, Yinghai Lu wrote: >> > >> > so looks bios program wrong address to the radon card? >> > >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): > > --- a/drivers/gpu/drm/radeon/radeon_device.c > +++ b/drivers/gpu/drm/radeon/radeon_device.c > @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, > struct radeon_mc *mc) > mc->gtt_size = size_bf; > } > mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - > mc->gtt_size; > + if (mc->gtt_start == 0xa000) > + mc->gtt_start = 0x8000; > } else { > if (mc->gtt_size > size_af) { > dev_warn(rdev->dev, "limiting GTT\n"); > > And this makes a difference, with this change on-top of -rc3 the box boots > fine. So there seems to be some dependency between the GART base and the GTT > base even when they are in different address spaces. > > Alex, can you comment on this? Wierd either a hw bug or some access to the GTT is leaking out before, things are setup properly, I think the RS780/880 docs are on the website, but generally the address spaces are completely separate so anything getting through is very unusual. Dave. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 03:31:09PM -0700, H. Peter Anvin wrote: > On 04/13/2011 03:22 PM, Joerg Roedel wrote: > > On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote: > >> On 04/13/2011 02:50 PM, Joerg Roedel wrote: > >>> On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote: > -addr = memblock_find_in_range(0, 1ULL<<32, aper_size, > 512ULL<<20); > +addr = memblock_find_in_range(0, 1ULL<<32, aper_size, > 512ULL<<21); > >>> > >>> Btw, while looking at this code I wondered why the 512M goal is enforced > >>> by the alignment. Start could be set to 512M instead and the alignment > >>> can be aper_size as it should. Any reason for such a big alignment? > >>> > >>> Joerg > >>> > >>> P.S.: The box is still in the office, I will try this debug-patch > >>> tomorrow. > >> > >> The only reason that I can think of is that the aperture itself can be > >> huge, and perhaps 512 MiB is the biggest such known. > > > > Well, that would work as well by just using aper_size as alignment, the > > aperture needs to be aligned on its size anyway. This code only runs > > when Linux allocates the aperture itself and if I am mistaken is uses > > always 64MB when doing this. > > Yes, I would agree with that. The sane thing would be to set the base > to whatever address needs to be guarded against (WHICH SHOULD BE > MOTIVATED), and use aper_size as alignment, *unless* we are only using > the initial portion of a much larger hardware structure that needs > natural alignment (which isn't clear to me, I do know we sometimes use > only a fraction of the GART, but that doesn't mean we need to > naturally-align the entire thing, nor that 512 MiB is sufficient to do so.) Whats allocated here is the address-space for the aperture. The code actually allocates the memory but all it needs is the physical address range. This range is later programmed into hardware as the GART aperture (the area the GART remaps). The Linux code can split the aperture if necessary for DMA-API usage and AGP usage. In that case both users get a half of the aperture and manage them itself. Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 06:58:46PM -0700, H. Peter Anvin wrote: > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > so looks bios program wrong address to the radon card? > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > over memory areas reserved by the BIOS. If I were to guess, I would say > that the problem here seems to be that the kernel thinks it is > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > Alex D., could you comment on the "num cpu pages" bit? Okay, I tried the debug-patch from Yinghai (posted to the bugzilla): --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -325,6 +325,8 @@ void radeon_gtt_location(struct radeon_device *rdev, struct radeon_mc *mc) mc->gtt_size = size_bf; } mc->gtt_start = (mc->vram_start & ~mc->gtt_base_align) - mc->gtt_size; + if (mc->gtt_start == 0xa000) + mc->gtt_start = 0x8000; } else { if (mc->gtt_size > size_af) { dev_warn(rdev->dev, "limiting GTT\n"); And this makes a difference, with this change on-top of -rc3 the box boots fine. So there seems to be some dependency between the GART base and the GTT base even when they are in different address spaces. Alex, can you comment on this? Regards, Joerg ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Wed, 13 Apr 2011 19:33:40 -0700 Linus Torvalds wrote: > On Wednesday, April 13, 2011, Linus Torvalds > wrote: > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > >> > >> Yes. However, even if we *do* revert (and the time is running short on > >> not reverting) I would like to understand this particular one, simply > >> because I think it may very well be a problem that is manifesting itself > >> in other ways on other systems. > > sorry, fingerfart. Anyway, I agree 100%. > > we definitely want to also understand the reason for things not > working, even if we do revert.. Definitely because if it fails when the "magic" involves the GART base it starts to sound like something may be hitting the wrong address space or not flushing properly. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 03:01:10PM -0700, H. Peter Anvin wrote: > On 04/13/2011 02:50 PM, Joerg Roedel wrote: > > On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote: > >> - addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20); > >> + addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21); > > > > Btw, while looking at this code I wondered why the 512M goal is enforced > > by the alignment. Start could be set to 512M instead and the alignment > > can be aper_size as it should. Any reason for such a big alignment? > > > > Joerg > > > > P.S.: The box is still in the office, I will try this debug-patch > > tomorrow. > > The only reason that I can think of is that the aperture itself can be > huge, and perhaps 512 MiB is the biggest such known. Well, that would work as well by just using aper_size as alignment, the aperture needs to be aligned on its size anyway. This code only runs when Linux allocates the aperture itself and if I am mistaken is uses always 64MB when doing this. Joerg
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 01:48:48PM -0700, Yinghai Lu wrote: > - addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<20); > + addr = memblock_find_in_range(0, 1ULL<<32, aper_size, 512ULL<<21); Btw, while looking at this code I wondered why the 512M goal is enforced by the alignment. Start could be set to 512M instead and the alignment can be aper_size as it should. Any reason for such a big alignment? Joerg P.S.: The box is still in the office, I will try this debug-patch tomorrow.
Re: Linux 2.6.39-rc3
On 04/13/2011 07:07 PM, Dave Airlie wrote: >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > These are not CPU addresses. I think we've stated that already. Not the > droids. > > the num cpu pages is how many CPU pages would be needed to fill the GPU > GTT, for those crazy cases where CPU pagesize != GPU pagesize. > OK, well, something is still weird. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On 04/13/2011 07:07 PM, Dave Airlie wrote: >> >> Okay, staring at this, it definitely seems toxic to overlay the GART >> over memory areas reserved by the BIOS. If I were to guess, I would say >> that the problem here seems to be that the kernel thinks it is >> overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in >> size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. >> >> Alex D., could you comment on the "num cpu pages" bit? > > These are not CPU addresses. I think we've stated that already. Not the > droids. > > the num cpu pages is how many CPU pages would be needed to fill the GPU > GTT, for those crazy cases where CPU pagesize != GPU pagesize. > OK, well, something is still weird. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 12:14:55PM -0700, Yinghai Lu wrote: > thanks for the bisecting... > > so those two patches uncover some problems. > > [0.00] Checking aperture... > [0.00] No AGP bridge found > [0.00] Node 0: aperture @ a000 size 32 MB > [0.00] Aperture pointing to e820 RAM. Ignoring. > [0.00] Your BIOS doesn't leave a aperture memory hole > [0.00] Please enable the IOMMU option in the BIOS setup > [0.00] This costs you 64 MB of RAM > [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff] > aperture64 > [0.00] Mapping aperture over 65536 KB of RAM @ a000 > > so kernel try to reallocate apperture. because BIOS allocated is pointed to > RAM or size is too small. It is actually beyond 4GB on that machine, this value read here is from the previous kernel-boot. The BIOS does not reset these values on a reboot. > but your radeon does use [0xa000, 0xbfff) Yes, I suspected that too (and spent a few hours reading radeon code), but then I talked the Alex Deucher and he explained that these addresses which the driver prints for GTT and VRAM are in the GPU address space and do not refer to system ram. So this shouldn't be the problem. Joerg
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 11:39:29AM -0700, H. Peter Anvin wrote: > On 04/13/2011 10:21 AM, Joerg Roedel wrote: > > On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote: > >> Could you please send the before/after bootlog (in particular all memory > >> init > >> messages included) and your .config? > >> > >> before: f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) > >> out of init_memory_mapping() > >> after: d2137d5af425: Merge branch 'linus' into x86/bootmem > >> > >> I've Cc:-ed more people who might have an idea about it. > > > > Okay, I have done some more bisecting and debugging today. > > > > First of all, *huge* thanks for this effort. At least we need to track > down the bits that need to be reverted -- it is past rc3, and it's time > to see what we should revert and tell the submitter to try again next cycle. > > This looks to be the same issue as in bugzilla 33012: > > https://bugzilla.kernel.org/show_bug.cgi?id=33012 > > ... so it would be good if we could keep the information in there. Yes, I try to find my korg bugzilla account again and drop the information from this email there. Joerg
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 11:51:39AM -0700, H. Peter Anvin wrote: > On 04/13/2011 10:21 AM, Joerg Roedel wrote: > > > > First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where > > only a couple of patches and merged v2.6.38-rc4 in at every step. There > > was no failure found. > > Then I tried this again, but this time I merged v2.6.38-rc5 at every > > step and was successful. The bad commit in this branch turned out to be > > > > 1a4a678b12c84db9ae5dce424e0e97f0559bb57c > > > > which is related to memblock. > > > > Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5 > > is needed to trigger the failure, so I used f005fe12b90c as a base, > > bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step > > into the base and tested. Here the bad commit turned out to be > > > > e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 > > > > which is related to gart. It turned out that the gart aperture on that > > box is on another position with these patches. Before it was as > > 0xa400 and now it is at 0xa000. It seems like this has something > > to do with the root-cause. > > > > Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the > > problem btw. and booting with iommu=soft also works, but I have no idea > > yet why the aperture at that address is a problem (with the patch > > reverted the aperture lands at 0x8000). > > > > Does reverting e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 solve the > problem for you? No, reverting that patch doesn't make the problem go away (and the gart aperture is still on 0xa000). I tested this in 39-rc3, I havn't tested if it makes a difference on the original bisect-commit from Ingo, probably it does (don't know if that matters). Strange about this commit is that it fixes an x86 gart aperture allocation bug in generic memblock code. > 1a4a678b12c84db9ae5dce424e0e97f0559bb57c is a memory-allocation-order > patch, which have a nasty tendency to unmask bugs elsewhere in the > kernel. However, e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 looks > positively strange (and it doesn't exactly help that the description is > written in Yinghai-ese and is therefore nearly impossible to decode, > never mind tell if it is remotely correct.) I think that the two commits are okay and the bug is somewhere else, but I have no idea yet were to look next. I spent some time looking at radeon code and talking to Alex about it (because it seemed suspicous that the GTT is on 0xa000 too, but as Alex explained me this is an address in the GPU address space and shouldn't matter). Regards, Joerg
Re: Linux 2.6.39-rc3
Hello, On Wed, Apr 13, 2011 at 07:33:40PM -0700, Linus Torvalds wrote: > On Wednesday, April 13, 2011, Linus Torvalds > wrote: > > On Wednesday, April 13, 2011, H. Peter Anvin wrote: > >> > >> Yes. However, even if we *do* revert (and the time is running short on > >> not reverting) I would like to understand this particular one, simply > >> because I think it may very well be a problem that is manifesting itself > >> in other ways on other systems. > > sorry, fingerfart. Anyway, I agree 100%. > > we definitely want to also understand the reason for things not > working, even if we do revert.. There were (and still are) places where memblock callers implemented ad-hoc top-down allocation by stepping down start limit until allocation succeeds. Several of them have been removed since top-down became the default behavior, so simply reverting the commit is likely to cause subtle issues. Maybe the best approach is introducing @topdown parameter and use it selectively for pure memory allocations. Thanks. -- tejun ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On Wednesday, April 13, 2011, Linus Torvalds wrote: > On Wednesday, April 13, 2011, H. Peter Anvin wrote: >> >> Yes. However, even if we *do* revert (and the time is running short on >> not reverting) I would like to understand this particular one, simply >> because I think it may very well be a problem that is manifesting itself >> in other ways on other systems. sorry, fingerfart. Anyway, I agree 100%. we definitely want to also understand the reason for things not working, even if we do revert.. Linus >> of complete b*llsh*t magic numbers in this > ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Wednesday, April 13, 2011, Linus Torvalds wrote: > On Wednesday, April 13, 2011, H. Peter Anvin wrote: >> >> Yes. ?However, even if we *do* revert (and the time is running short on >> not reverting) I would like to understand this particular one, simply >> because I think it may very well be a problem that is manifesting itself >> in other ways on other systems. sorry, fingerfart. Anyway, I agree 100%. we definitely want to also understand the reason for things not working, even if we do revert.. Linus >> of complete b*llsh*t magic numbers in this >
Re: Linux 2.6.39-rc3
On Wednesday, April 13, 2011, H. Peter Anvin wrote: > > Yes. However, even if we *do* revert (and the time is running short on > not reverting) I would like to understand this particular one, simply > because I think it may very well be a problem that is manifesting itself > in other ways on other systems. > > The other thing that this has uncovered is that we already have a bunch > of complete b*llsh*t magic numbers in this ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On Wednesday, April 13, 2011, H. Peter Anvin wrote: > > Yes. ?However, even if we *do* revert (and the time is running short on > not reverting) I would like to understand this particular one, simply > because I think it may very well be a problem that is manifesting itself > in other ways on other systems. > > The other thing that this has uncovered is that we already have a bunch > of complete b*llsh*t magic numbers in this
Linux 2.6.39-rc3
On Wed, Apr 13, 2011 at 08:46:09AM +0200, Ingo Molnar wrote: > Could you please send the before/after bootlog (in particular all memory init > messages included) and your .config? > > before: f005fe12b90c: x86-64: Move out cleanup higmap [_brk_end, _end) out > of init_memory_mapping() > after: d2137d5af425: Merge branch 'linus' into x86/bootmem > > I've Cc:-ed more people who might have an idea about it. Okay, I have done some more bisecting and debugging today. First of all, I bisected between v2.6.37-rc2..f005fe12b90c which where only a couple of patches and merged v2.6.38-rc4 in at every step. There was no failure found. Then I tried this again, but this time I merged v2.6.38-rc5 at every step and was successful. The bad commit in this branch turned out to be 1a4a678b12c84db9ae5dce424e0e97f0559bb57c which is related to memblock. Then I tried to find out which change between 2.6.38-rc4 and 2.6.38-rc5 is needed to trigger the failure, so I used f005fe12b90c as a base, bisected between v2.6.38-rc4..v2.6.38-rc5 and merged every bisect step into the base and tested. Here the bad commit turned out to be e6d2e2b2b1e1455df16d68a78f4a3874c7b3ad20 which is related to gart. It turned out that the gart aperture on that box is on another position with these patches. Before it was as 0xa400 and now it is at 0xa000. It seems like this has something to do with the root-cause. Reverting commit 1a4a678b12c84db9ae5dce424e0e97f0559bb57c fixes the problem btw. and booting with iommu=soft also works, but I have no idea yet why the aperture at that address is a problem (with the patch reverted the aperture lands at 0x8000). I have put some debug-data online. There is my .config and two dmesg-files for good (==2.6.39-rc3 + revert) and bad (==2.6.39-rc3) I also created these dmesg-files again with memblock=debug, maybe that helps to find the problem. The files are at http://www.8bytes.org/~joro/debug/ Or someone else has an idea about the issue... Joerg
Re: Linux 2.6.39-rc3
On Wed, 2011-04-13 at 18:58 -0700, H. Peter Anvin wrote: > On 04/13/2011 12:14 PM, Yinghai Lu wrote: > > > > so those two patches uncover some problems. > > > > [0.00] Checking aperture... > > [0.00] No AGP bridge found > > [0.00] Node 0: aperture @ a000 size 32 MB > > [0.00] Aperture pointing to e820 RAM. Ignoring. > > [0.00] Your BIOS doesn't leave a aperture memory hole > > [0.00] Please enable the IOMMU option in the BIOS setup > > [0.00] This costs you 64 MB of RAM > > [0.00] memblock_x86_reserve_range: [0xa000-0xa3ff] > > aperture64 > > [0.00] Mapping aperture over 65536 KB of RAM @ a000 > > > > so kernel try to reallocate apperture. because BIOS allocated is pointed to > > RAM or size is too small. > > > > but your radeon does use [0xa000, 0xbfff) > > > > [4.281993] radeon :01:05.0: VRAM: 320M 0xC000 - > > 0xD3FF (320M used) > > [4.290672] radeon :01:05.0: GTT: 512M 0xA000 - > > 0xBFFF > > [4.298550] [drm] Detected VRAM RAM=320M, BAR=256M > > [4.309857] [drm] RAM width 32bits DDR > > [4.313748] [TTM] Zone kernel: Available graphics memory: 1896524 kiB. > > [4.320379] [TTM] Initializing pool allocator. > > [4.324948] [drm] radeon: 320M of VRAM memory ready > > [4.329832] [drm] radeon: 512M of GTT memory ready. > > > > and the one seems working: > > > > [0.00] Checking aperture... > > [0.00] No AGP bridge found > > [0.00] Node 0: aperture @ a000 size 32 MB > > [0.00] Aperture pointing to e820 RAM. Ignoring. > > [0.00] Your BIOS doesn't leave a aperture memory hole > > [0.00] Please enable the IOMMU option in the BIOS setup > > [0.00] This costs you 64 MB of RAM > > [0.00] memblock_x86_reserve_range: [0x8000-0x83ff] > > aperture64 > > [0.00] Mapping aperture over 65536 KB of RAM @ 8000 > > [0.00] memblock_x86_reserve_range: [0xacb6bdc0-0xacb6bddf] > > BOOTMEM > > > > will use different position... > > > > [4.250159] radeon :01:05.0: VRAM: 320M 0xC000 - > > 0xD3FF (320M used) > > [4.258830] radeon :01:05.0: GTT: 512M 0xA000 - > > 0xBFFF > > [4.266742] [drm] Detected VRAM RAM=320M, BAR=256M > > [4.271549] [drm] RAM width 32bits DDR > > [4.275435] [TTM] Zone kernel: Available graphics memory: 1896526 kiB. > > [4.282066] [TTM] Initializing pool allocator. > > [4.282085] usb 7-2: new full speed USB device number 2 using ohci_hcd > > [4.293076] [drm] radeon: 320M of VRAM memory ready > > [4.298277] [drm] radeon: 512M of GTT memory ready. > > [4.303218] [drm] Supports vblank timestamp caching Rev 1 (10.10.2010). > > [4.309854] [drm] Driver supports precise vblank timestamp query. > > [4.315970] [drm] radeon: irq initialized. > > [4.320094] [drm] GART: num cpu pages 131072, num gpu pages 131072 > > > > So question is why radeon is using the address [0xa000 - 0xc00], > > and in E820 it is RAM > > > > [0.00] BIOS-e820: 0010 - acb8d000 (usable) > > [0.00] BIOS-e820: acb8d000 - acb8f000 (reserved) > > [0.00] BIOS-e820: acb8f000 - afce9000 (usable) > > [0.00] BIOS-e820: afce9000 - afd21000 (reserved) > > [0.00] BIOS-e820: afd21000 - afd4f000 (usable) > > [0.00] BIOS-e820: afd4f000 - afdcf000 (reserved) > > [0.00] BIOS-e820: afdcf000 - afecf000 (ACPI NVS) > > [0.00] BIOS-e820: afecf000 - afeff000 (ACPI data) > > [0.00] BIOS-e820: afeff000 - aff0 (usable) > > > > so looks bios program wrong address to the radon card? > > > > Okay, staring at this, it definitely seems toxic to overlay the GART > over memory areas reserved by the BIOS. If I were to guess, I would say > that the problem here seems to be that the kernel thinks it is > overlaying 64 MiB of memory, but the actual GART is in fact 512 MiB in > size -- 131072 CPU pages -- which now overlaps the BIOS reserved areas. > > Alex D., could you comment on the "num cpu pages" bit? These are not CPU addresses. I think we've stated that already. Not the droids. the num cpu pages is how many CPU pages would be needed to fill the GPU GTT, for those crazy cases where CPU pagesize != GPU pagesize. Dave. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: Linux 2.6.39-rc3
On 04/13/2011 04:39 PM, Linus Torvalds wrote: > > - Choice #2: understand exactly _what_ goes wrong, and fix it > analytically (ie by _understanding_ the problem, and being able to > solve it exactly, and in a way you can argue about without having to > resort to "magic happens"). > > Now, the whole analytic approach (aka "computer sciency" approach), > where you can actually think about the problem without having any > pesky "reality" impact the solution is obviously the one we tend to > prefer. Sadly, it's seldom the one we can use in reality when it comes > to things like resource allocation, since we end up starting off with > often buggy approximations of what the actual hardware is all about > (ie broken firmware tables). > > So I'd love to know exactly why one random number works, and why > another one doesn't. But as long as we do _not_ know the "Why" of it, > we will have to revert. > Yes. However, even if we *do* revert (and the time is running short on not reverting) I would like to understand this particular one, simply because I think it may very well be a problem that is manifesting itself in other ways on other systems. The other thing that this has uncovered is that we already have a bunch of complete b*llsh*t magic numbers in this path, some of which are trivially shown to be wrong or at least completely arbitrary, so there are more issues here :( -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Linux 2.6.39-rc3
On 04/13/2011 04:39 PM, Linus Torvalds wrote: > > - Choice #2: understand exactly _what_ goes wrong, and fix it > analytically (ie by _understanding_ the problem, and being able to > solve it exactly, and in a way you can argue about without having to > resort to "magic happens"). > > Now, the whole analytic approach (aka "computer sciency" approach), > where you can actually think about the problem without having any > pesky "reality" impact the solution is obviously the one we tend to > prefer. Sadly, it's seldom the one we can use in reality when it comes > to things like resource allocation, since we end up starting off with > often buggy approximations of what the actual hardware is all about > (ie broken firmware tables). > > So I'd love to know exactly why one random number works, and why > another one doesn't. But as long as we do _not_ know the "Why" of it, > we will have to revert. > Yes. However, even if we *do* revert (and the time is running short on not reverting) I would like to understand this particular one, simply because I think it may very well be a problem that is manifesting itself in other ways on other systems. The other thing that this has uncovered is that we already have a bunch of complete b*llsh*t magic numbers in this path, some of which are trivially shown to be wrong or at least completely arbitrary, so there are more issues here :( -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf.