Re: Nvidia woes...

2010-04-28 Thread Phil Perry

Alan Bartlett wrote:

On 28 April 2010 14:10, Alan Bartlett  wrote:

On 27 April 2010 21:54, Mark Stodola  wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Correct for the current driver (kmod-nvidia) and 173 series 
(kmod-nvidia-173xx) legacy driver, but the older kmod-nvidia-96xx legacy 
driver is currently kABI compliant will all current EL5 kernels :)


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 15:41, Jaroslaw Polok  wrote:

> We do use some NVIDIA's but mostly with 96xx legacy series
> driver (however we may start using the 195 current series
> on our future hardware):
>
> So far we use nvidia packages we maintain ourselves but it would
> be interesting for us to change this situation...

Hello Jarek,

As long as you are using SL 5.3 or above, you should find that the
packages will fulfil your requirements. The current (as distinct from
the legacy) package is regularly rebuilt when nVidia releases a new
version.

> Speaking of which I'm little bit confused about nvidia kernel
> modules and kABI: checking  http://dup.et.redhat.com/
> and using kABI testing script gives following result on your
>
> kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:
>
> ./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
> Red Hat Enterprise Linux 5 ABI Checker
> --
>
> ABI Checker version: 1.2
>
> Module:    ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
> Kernel:    2.6.18-194.el5
> Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist
>
> WARNING: The following symbols are used by your module
> WARNING: and are not on the ABI whitelist.
>
> symbol: acpi_walk_namespace
> symbol: agp_bridges
> symbol: acpi_get_handle
> symbol: acpi_os_wait_events_complete
> symbol: acpi_evaluate_object
> symbol: acpi_bus_get_device
> symbol: acpi_install_notify_handler
> symbol: acpi_evaluate_integer
> symbol: acpi_remove_notify_handler

That is a known issue and as I am not one of the team maintaining the
nVidia packages, it would be best if I do not go into great details
but just refer you to the relevant bug tracker entries [1][2]. The
whole issue of the kernel ABI whitelist and the requirements of
certain kmod packages requiring non-listed symbols is something that
we have discussed with Jon Masters, of Red Hat.

Although I know that other members of the ELRepo Admin Team are
subscribed to this list, I think it might be best to transfer this
discussion to the ELRepo mailing list [3] where the more appropriate
audience can be found.

Regards,
Alan.

[1] http://elrepo.org/bugs/view.php?id=30
[2] https://bugzilla.redhat.com/show_bug.cgi?id=520891
[3] http://lists.elrepo.org/mailman/listinfo/elrepo


Re: Nvidia woes...

2010-04-28 Thread Steven Timm

On Wed, 28 Apr 2010, Alan Bartlett wrote:


On 28 April 2010 14:10, Alan Bartlett  wrote:

On 27 April 2010 21:54, Mark Stodola  wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series.


It is possible to update any SL5.x system to the latest kernel
that is available, so you should be able to get around that.
i.e. I am running kernels from the 5.4 series on 5.3 systems all the
time and did 5.3 on 5.2 also.
Steve



Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



--
--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 14:10, Alan Bartlett  wrote:
> On 27 April 2010 21:54, Mark Stodola  wrote:
>> Hey everyone,
>>
>> I currently have deployed a number of SL 5.2 i386 machines.

Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Re: Nvidia woes...

2010-04-28 Thread Jaroslaw Polok
Alan Bartlett wrote:

[...]
> I don't use nVidia graphics cards and also should mention my
> connection to the ELRepo Project "up front" but have you tried using
> the kernel independent, kABI tracking kmod packages that the ELRepo
> Project provides? [1]

We do use some NVIDIA's but mostly with 96xx legacy series
driver (however we may start using the 195 current series
on our future hardware):

So far we use nvidia packages we maintain ourselves but it would
be interesting for us to change this situation...

Speaking of which I'm little bit confused about nvidia kernel
modules and kABI: checking  http://dup.et.redhat.com/
and using kABI testing script gives following result on your

kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:

./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Red Hat Enterprise Linux 5 ABI Checker
--

ABI Checker version: 1.2

Module:./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Kernel:2.6.18-194.el5
Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist

WARNING: The following symbols are used by your module
WARNING: and are not on the ABI whitelist.

symbol: acpi_walk_namespace
symbol: agp_bridges
symbol: acpi_get_handle
symbol: acpi_os_wait_events_complete
symbol: acpi_evaluate_object
symbol: acpi_bus_get_device
symbol: acpi_install_notify_handler
symbol: acpi_evaluate_integer
symbol: acpi_remove_notify_handler


Are you using a different ABI Checker script ?
.. since the version 1.2 does not seem to be happy
about nvidia kernel modules here 

Best Regards

Jarek

__
---
_ Jaroslaw_Polok __ CERN - IT/OIS/ODS _
_ http://home.cern.ch/~jpolok ___ tel_+41_22_767_1834 _
_ +41_78_792_0795 _


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 27 April 2010 21:54, Mark Stodola  wrote:
> Hey everyone,
>
> I currently have deployed a number of SL 5.2 i386 machines.  Due to the
> circumstances, I'm not in a position to upgrade them to the latest 5.x with
> ease.  Lately I've been having trouble with systems locking up hard that are
> running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.
>  Dual monitors connected via DVI, twinview.
>
> I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying
> success.  The Quadro seems to have lasted about a month before locking,
> while the 9600GT is much more often, daily/weekly.  I'm running the stock
> 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server
> 1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  I'm
> having no luck capturing log data or kdump data.
>
> The strange part is, having identical hardware in several locations, only
> some experience the issue.
>
> Hardware:
> Intel DG43NB motherboards (bios revision doesn't seem to matter at this
> point, running 98,99,104, or 105)
> ^- hardware revision is the same for all of them: AAE34877-402
> Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G
> seagates
> Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
> Single stick, 1GB DDR2 (800) memory
> PS/2 Keyboard/mouse
>
> I'm curious if anyone else has run into similar problems such as this, and
> if they have found a solution.  I'm looking at trying the 185.18.31 drivers,
> which seem to be "certified" for linux by a few software vendors, according
> to nvidia's website.
>
> What driver versions and/or card make/models are people using successfully?
>  Any help or pointers are greatly appreciated.
>
> As I said, not all of them are misbehaving, and I have several with the same
> config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).
>
> Getting desperate,
> Mark

Mark,

I don't use nVidia graphics cards and also should mention my
connection to the ELRepo Project "up front" but have you tried using
the kernel independent, kABI tracking kmod packages that the ELRepo
Project provides? [1]

There are three different packages available, kmod-nvidia [2],
kmod-nvidia-96xx [3] and kmod-nvidia-173xx [4].

If you would like to discuss the usage before trying any one of them,
there is a ELRepo users' mailing list [5] and, if there should be a
problem, the ELRepo bug tracker [6].

Regards,
Alan.

[1] http://elrepo.org
[2] http://elrepo.org/tiki/kmod-nvidia
[3] http://elrepo.org/tiki/kmod-nvidia-96xx
[4] http://elrepo.org/tiki/kmod-nvidia-173xx
[5] http://lists.elrepo.org/mailman/listinfo/elrepo
[6] http://elrepo.org/bugs/main_page.php


RE: Nvidia woes...

2010-04-27 Thread Laski, Michael
Mark,

I had a problem like that with a (now decommissioned) SL5.3 box with a GeForce 
FX5000 series card.  I seem to recall that after installing the nVidia 190.53 
drivers, the issue disappeared.  Under 190.42, the machine would randomly lock 
up then reboot--it was really frustrating.  I never really found a cause since  
updating the drivers made the issue go away.

Good luck!

-Mike


-Original Message-
From: owner-scientific-linux-us...@listserv.fnal.gov 
[mailto:owner-scientific-linux-us...@listserv.fnal.gov] On Behalf Of Mark 
Stodola
Sent: Tuesday, April 27, 2010 4:55 PM
To: SCIENTIFIC-LINUX-USERS@listserv.fnal.gov
Subject: Nvidia woes...

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
circumstances, I'm not in a position to upgrade them to the latest 5.x 
with ease.  Lately I've been having trouble with systems locking up hard 
that are running an nvidia card using the 190.42 or 195.36.15 
proprietary drivers.  Dual monitors connected via DVI, twinview.

I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying 
success.  The Quadro seems to have lasted about a month before locking, 
while the 9600GT is much more often, daily/weekly.  I'm running the 
stock 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 
1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  
I'm having no luck capturing log data or kdump data.

The strange part is, having identical hardware in several locations, 
only some experience the issue.

Hardware:
Intel DG43NB motherboards (bios revision doesn't seem to matter at this 
point, running 98,99,104, or 105)
^- hardware revision is the same for all of them: AAE34877-402
Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G 
seagates
Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
Single stick, 1GB DDR2 (800) memory
PS/2 Keyboard/mouse

I'm curious if anyone else has run into similar problems such as this, 
and if they have found a solution.  I'm looking at trying the 185.18.31 
drivers, which seem to be "certified" for linux by a few software 
vendors, according to nvidia's website.

What driver versions and/or card make/models are people using 
successfully?  Any help or pointers are greatly appreciated.

As I said, not all of them are misbehaving, and I have several with the 
same config minus the video card running fine on SL5.2 and Windows XP 
Pro (SP3).

Getting desperate,
Mark

-- 
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591


Re: Nvidia woes...

2010-04-27 Thread Sergio Ballestrero
 Hi,
SL5.4 with nVidia 185.18.36 on NV286 & FX370 .
See 
http://www.mail-archive.com/scientific-linux-users@listserv.fnal.gov/msg05399.html
 for the gory details...

 Cheers,
  Sergio

On 27 Apr 2010, at 23:17, Mark Stodola wrote:

> Sergio,
> 
> I haven't noticed any memory leaks, but I also haven't been actively hunting 
> them down.  There don't seem to be any signs of dwindling performance before 
> this happens.  Most times, it is just idling overnight.  At most, there is a 
> small amount of network traffic on an isolated LAN of no more than 5 or so 
> devices, mostly Win XP or SL5.2 systems (often running off a custom livecd 
> based on Urs' scripts).
> 
> What card/config/drivers are you running?
> 
> Cheers,
> Mark
> 
> Sergio Ballestrero wrote:
>> Hello Mark,
>> we are having problems with X11 slowly leaking memory, which then leads to a 
>> system crash. Do you see anything similar?
>> My attempts at using valgrind have been inconclusive (if not confusing) up 
>> to now...
>> 
>> Cheers,
>>  Sergio
>> 
>> On 27 Apr 2010, at 22:54, Mark Stodola wrote:
>> 
>>  
>>> Hey everyone,
>>> 
>>> I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
>>> circumstances, I'm not in a position to upgrade them to the latest 5.x with 
>>> ease.  Lately I've been having trouble with systems locking up hard that 
>>> are running an nvidia card using the 190.42 or 195.36.15 proprietary 
>>> drivers.  Dual monitors connected via DVI, twinview.
>>> 
>>> I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying 
>>> success.  The Quadro seems to have lasted about a month before locking, 
>>> while the 9600GT is much more often, daily/weekly.  I'm running the stock 
>>> 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 
>>> 1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  I'm 
>>> having no luck capturing log data or kdump data.
>>> 
>>> The strange part is, having identical hardware in several locations, only 
>>> some experience the issue.
>>> 
>>> Hardware:
>>> Intel DG43NB motherboards (bios revision doesn't seem to matter at this 
>>> point, running 98,99,104, or 105)
>>> ^- hardware revision is the same for all of them: AAE34877-402
>>> Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G 
>>> seagates
>>> Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
>>> Single stick, 1GB DDR2 (800) memory
>>> PS/2 Keyboard/mouse
>>> 
>>> I'm curious if anyone else has run into similar problems such as this, and 
>>> if they have found a solution.  I'm looking at trying the 185.18.31 
>>> drivers, which seem to be "certified" for linux by a few software vendors, 
>>> according to nvidia's website.
>>> 
>>> What driver versions and/or card make/models are people using successfully? 
>>>  Any help or pointers are greatly appreciated.
>>> 
>>> As I said, not all of them are misbehaving, and I have several with the 
>>> same config minus the video card running fine on SL5.2 and Windows XP Pro 
>>> (SP3).
>>> 
>>> Getting desperate,
>>> Mark
>>> 
>>> -- 
>>> Mr. Mark V. Stodola
>>> Digital Systems Engineer
>>> 
>>> National Electrostatics Corp.
>>> P.O. Box 620310
>>> Middleton, WI 53562-0310 USA
>>> Phone: (608) 831-7600
>>> Fax: (608) 831-9591
>>>
>> 
>>  
> 
> 
> -- 
> Mr. Mark V. Stodola
> Digital Systems Engineer
> 
> National Electrostatics Corp.
> P.O. Box 620310
> Middleton, WI 53562-0310 USA
> Phone: (608) 831-7600
> Fax: (608) 831-9591
> 

-- 
 Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
 University of Johannesburg, Physics Department
 ATLAS TDAQ sysadmin group - Office:75240 OnCall:164851


Re: Nvidia woes...

2010-04-27 Thread Mark Stodola

Sergio,

I haven't noticed any memory leaks, but I also haven't been actively 
hunting them down.  There don't seem to be any signs of dwindling 
performance before this happens.  Most times, it is just idling 
overnight.  At most, there is a small amount of network traffic on an 
isolated LAN of no more than 5 or so devices, mostly Win XP or SL5.2 
systems (often running off a custom livecd based on Urs' scripts).


What card/config/drivers are you running?

Cheers,
Mark

Sergio Ballestrero wrote:

 Hello Mark,
we are having problems with X11 slowly leaking memory, which then leads to a 
system crash. Do you see anything similar?
 My attempts at using valgrind have been inconclusive (if not confusing) up to 
now...

 Cheers,
  Sergio

On 27 Apr 2010, at 22:54, Mark Stodola wrote:

  

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
circumstances, I'm not in a position to upgrade them to the latest 5.x with 
ease.  Lately I've been having trouble with systems locking up hard that are 
running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.  Dual 
monitors connected via DVI, twinview.

I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying success.  
The Quadro seems to have lasted about a month before locking, while the 9600GT 
is much more often, daily/weekly.  I'm running the stock 5.2 kernel 
(2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 1.1.1-48.41.el5_2.1).  The 
systems are generally idle when it happens.  I'm having no luck capturing log 
data or kdump data.

The strange part is, having identical hardware in several locations, only some 
experience the issue.

Hardware:
Intel DG43NB motherboards (bios revision doesn't seem to matter at this point, 
running 98,99,104, or 105)
^- hardware revision is the same for all of them: AAE34877-402
Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G seagates
Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
Single stick, 1GB DDR2 (800) memory
PS/2 Keyboard/mouse

I'm curious if anyone else has run into similar problems such as this, and if they have 
found a solution.  I'm looking at trying the 185.18.31 drivers, which seem to be 
"certified" for linux by a few software vendors, according to nvidia's website.

What driver versions and/or card make/models are people using successfully?  
Any help or pointers are greatly appreciated.

As I said, not all of them are misbehaving, and I have several with the same 
config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).

Getting desperate,
Mark

--
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591



  



--
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591


Re: Nvidia woes...

2010-04-27 Thread Sergio Ballestrero
 Hello Mark,
we are having problems with X11 slowly leaking memory, which then leads to a 
system crash. Do you see anything similar?
 My attempts at using valgrind have been inconclusive (if not confusing) up to 
now...

 Cheers,
  Sergio

On 27 Apr 2010, at 22:54, Mark Stodola wrote:

> Hey everyone,
> 
> I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
> circumstances, I'm not in a position to upgrade them to the latest 5.x with 
> ease.  Lately I've been having trouble with systems locking up hard that are 
> running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.  
> Dual monitors connected via DVI, twinview.
> 
> I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying success. 
>  The Quadro seems to have lasted about a month before locking, while the 
> 9600GT is much more often, daily/weekly.  I'm running the stock 5.2 kernel 
> (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 1.1.1-48.41.el5_2.1).  The 
> systems are generally idle when it happens.  I'm having no luck capturing log 
> data or kdump data.
> 
> The strange part is, having identical hardware in several locations, only 
> some experience the issue.
> 
> Hardware:
> Intel DG43NB motherboards (bios revision doesn't seem to matter at this 
> point, running 98,99,104, or 105)
> ^- hardware revision is the same for all of them: AAE34877-402
> Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G seagates
> Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
> Single stick, 1GB DDR2 (800) memory
> PS/2 Keyboard/mouse
> 
> I'm curious if anyone else has run into similar problems such as this, and if 
> they have found a solution.  I'm looking at trying the 185.18.31 drivers, 
> which seem to be "certified" for linux by a few software vendors, according 
> to nvidia's website.
> 
> What driver versions and/or card make/models are people using successfully?  
> Any help or pointers are greatly appreciated.
> 
> As I said, not all of them are misbehaving, and I have several with the same 
> config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).
> 
> Getting desperate,
> Mark
> 
> -- 
> Mr. Mark V. Stodola
> Digital Systems Engineer
> 
> National Electrostatics Corp.
> P.O. Box 620310
> Middleton, WI 53562-0310 USA
> Phone: (608) 831-7600
> Fax: (608) 831-9591

-- 
 Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
 University of Johannesburg, Physics Department
 ATLAS TDAQ sysadmin group - Office:75240 OnCall:164851