Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:
 Hey everyone,

 I currently have deployed a number of SL 5.2 i386 machines.  Due to the
 circumstances, I'm not in a position to upgrade them to the latest 5.x with
 ease.  Lately I've been having trouble with systems locking up hard that are
 running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.
  Dual monitors connected via DVI, twinview.

 I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying
 success.  The Quadro seems to have lasted about a month before locking,
 while the 9600GT is much more often, daily/weekly.  I'm running the stock
 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server
 1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  I'm
 having no luck capturing log data or kdump data.

 The strange part is, having identical hardware in several locations, only
 some experience the issue.

 Hardware:
 Intel DG43NB motherboards (bios revision doesn't seem to matter at this
 point, running 98,99,104, or 105)
 ^- hardware revision is the same for all of them: AAE34877-402
 Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G
 seagates
 Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
 Single stick, 1GB DDR2 (800) memory
 PS/2 Keyboard/mouse

 I'm curious if anyone else has run into similar problems such as this, and
 if they have found a solution.  I'm looking at trying the 185.18.31 drivers,
 which seem to be certified for linux by a few software vendors, according
 to nvidia's website.

 What driver versions and/or card make/models are people using successfully?
  Any help or pointers are greatly appreciated.

 As I said, not all of them are misbehaving, and I have several with the same
 config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).

 Getting desperate,
 Mark

Mark,

I don't use nVidia graphics cards and also should mention my
connection to the ELRepo Project up front but have you tried using
the kernel independent, kABI tracking kmod packages that the ELRepo
Project provides? [1]

There are three different packages available, kmod-nvidia [2],
kmod-nvidia-96xx [3] and kmod-nvidia-173xx [4].

If you would like to discuss the usage before trying any one of them,
there is a ELRepo users' mailing list [5] and, if there should be a
problem, the ELRepo bug tracker [6].

Regards,
Alan.

[1] http://elrepo.org
[2] http://elrepo.org/tiki/kmod-nvidia
[3] http://elrepo.org/tiki/kmod-nvidia-96xx
[4] http://elrepo.org/tiki/kmod-nvidia-173xx
[5] http://lists.elrepo.org/mailman/listinfo/elrepo
[6] http://elrepo.org/bugs/main_page.php


Re: Nvidia woes...

2010-04-28 Thread Jaroslaw Polok
Alan Bartlett wrote:

[...]
 I don't use nVidia graphics cards and also should mention my
 connection to the ELRepo Project up front but have you tried using
 the kernel independent, kABI tracking kmod packages that the ELRepo
 Project provides? [1]

We do use some NVIDIA's but mostly with 96xx legacy series
driver (however we may start using the 195 current series
on our future hardware):

So far we use nvidia packages we maintain ourselves but it would
be interesting for us to change this situation...

Speaking of which I'm little bit confused about nvidia kernel
modules and kABI: checking  http://dup.et.redhat.com/
and using kABI testing script gives following result on your

kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:

./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Red Hat Enterprise Linux 5 ABI Checker
--

ABI Checker version: 1.2

Module:./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Kernel:2.6.18-194.el5
Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist

WARNING: The following symbols are used by your module
WARNING: and are not on the ABI whitelist.

symbol: acpi_walk_namespace
symbol: agp_bridges
symbol: acpi_get_handle
symbol: acpi_os_wait_events_complete
symbol: acpi_evaluate_object
symbol: acpi_bus_get_device
symbol: acpi_install_notify_handler
symbol: acpi_evaluate_integer
symbol: acpi_remove_notify_handler


Are you using a different ABI Checker script ?
.. since the version 1.2 does not seem to be happy
about nvidia kernel modules here 

Best Regards

Jarek

__
---
_ Jaroslaw_Polok __ CERN - IT/OIS/ODS _
_ http://home.cern.ch/~jpolok ___ tel_+41_22_767_1834 _
_ +41_78_792_0795 _


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:
 On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:
 Hey everyone,

 I currently have deployed a number of SL 5.2 i386 machines.

Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Re: Nvidia woes...

2010-04-28 Thread Steven Timm

On Wed, 28 Apr 2010, Alan Bartlett wrote:


On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:

On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series.


It is possible to update any SL5.x system to the latest kernel
that is available, so you should be able to get around that.
i.e. I am running kernels from the 5.4 series on 5.3 systems all the
time and did 5.3 on 5.2 also.
Steve



Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



--
--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 15:41, Jaroslaw Polok jaroslaw.po...@cern.ch wrote:

 We do use some NVIDIA's but mostly with 96xx legacy series
 driver (however we may start using the 195 current series
 on our future hardware):

 So far we use nvidia packages we maintain ourselves but it would
 be interesting for us to change this situation...

Hello Jarek,

As long as you are using SL 5.3 or above, you should find that the
packages will fulfil your requirements. The current (as distinct from
the legacy) package is regularly rebuilt when nVidia releases a new
version.

 Speaking of which I'm little bit confused about nvidia kernel
 modules and kABI: checking  http://dup.et.redhat.com/
 and using kABI testing script gives following result on your

 kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:

 ./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
 Red Hat Enterprise Linux 5 ABI Checker
 --

 ABI Checker version: 1.2

 Module:    ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
 Kernel:    2.6.18-194.el5
 Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist

 WARNING: The following symbols are used by your module
 WARNING: and are not on the ABI whitelist.

 symbol: acpi_walk_namespace
 symbol: agp_bridges
 symbol: acpi_get_handle
 symbol: acpi_os_wait_events_complete
 symbol: acpi_evaluate_object
 symbol: acpi_bus_get_device
 symbol: acpi_install_notify_handler
 symbol: acpi_evaluate_integer
 symbol: acpi_remove_notify_handler

That is a known issue and as I am not one of the team maintaining the
nVidia packages, it would be best if I do not go into great details
but just refer you to the relevant bug tracker entries [1][2]. The
whole issue of the kernel ABI whitelist and the requirements of
certain kmod packages requiring non-listed symbols is something that
we have discussed with Jon Masters, of Red Hat.

Although I know that other members of the ELRepo Admin Team are
subscribed to this list, I think it might be best to transfer this
discussion to the ELRepo mailing list [3] where the more appropriate
audience can be found.

Regards,
Alan.

[1] http://elrepo.org/bugs/view.php?id=30
[2] https://bugzilla.redhat.com/show_bug.cgi?id=520891
[3] http://lists.elrepo.org/mailman/listinfo/elrepo


Re: Nvidia woes...

2010-04-28 Thread Phil Perry

Alan Bartlett wrote:

On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:

On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Correct for the current driver (kmod-nvidia) and 173 series 
(kmod-nvidia-173xx) legacy driver, but the older kmod-nvidia-96xx legacy 
driver is currently kABI compliant will all current EL5 kernels :)


Re: Nvidia woes...

2010-04-27 Thread Mark Stodola

Sergio,

I haven't noticed any memory leaks, but I also haven't been actively 
hunting them down.  There don't seem to be any signs of dwindling 
performance before this happens.  Most times, it is just idling 
overnight.  At most, there is a small amount of network traffic on an 
isolated LAN of no more than 5 or so devices, mostly Win XP or SL5.2 
systems (often running off a custom livecd based on Urs' scripts).


What card/config/drivers are you running?

Cheers,
Mark

Sergio Ballestrero wrote:

 Hello Mark,
we are having problems with X11 slowly leaking memory, which then leads to a 
system crash. Do you see anything similar?
 My attempts at using valgrind have been inconclusive (if not confusing) up to 
now...

 Cheers,
  Sergio

On 27 Apr 2010, at 22:54, Mark Stodola wrote:

  

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
circumstances, I'm not in a position to upgrade them to the latest 5.x with 
ease.  Lately I've been having trouble with systems locking up hard that are 
running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.  Dual 
monitors connected via DVI, twinview.

I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying success.  
The Quadro seems to have lasted about a month before locking, while the 9600GT 
is much more often, daily/weekly.  I'm running the stock 5.2 kernel 
(2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 1.1.1-48.41.el5_2.1).  The 
systems are generally idle when it happens.  I'm having no luck capturing log 
data or kdump data.

The strange part is, having identical hardware in several locations, only some 
experience the issue.

Hardware:
Intel DG43NB motherboards (bios revision doesn't seem to matter at this point, 
running 98,99,104, or 105)
^- hardware revision is the same for all of them: AAE34877-402
Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G seagates
Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
Single stick, 1GB DDR2 (800) memory
PS/2 Keyboard/mouse

I'm curious if anyone else has run into similar problems such as this, and if they have 
found a solution.  I'm looking at trying the 185.18.31 drivers, which seem to be 
certified for linux by a few software vendors, according to nvidia's website.

What driver versions and/or card make/models are people using successfully?  
Any help or pointers are greatly appreciated.

As I said, not all of them are misbehaving, and I have several with the same 
config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).

Getting desperate,
Mark

--
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591



  



--
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591


Re: Nvidia woes...

2010-04-27 Thread Sergio Ballestrero
 Hi,
SL5.4 with nVidia 185.18.36 on NV286  FX370 .
See 
http://www.mail-archive.com/scientific-linux-users@listserv.fnal.gov/msg05399.html
 for the gory details...

 Cheers,
  Sergio

On 27 Apr 2010, at 23:17, Mark Stodola wrote:

 Sergio,
 
 I haven't noticed any memory leaks, but I also haven't been actively hunting 
 them down.  There don't seem to be any signs of dwindling performance before 
 this happens.  Most times, it is just idling overnight.  At most, there is a 
 small amount of network traffic on an isolated LAN of no more than 5 or so 
 devices, mostly Win XP or SL5.2 systems (often running off a custom livecd 
 based on Urs' scripts).
 
 What card/config/drivers are you running?
 
 Cheers,
 Mark
 
 Sergio Ballestrero wrote:
 Hello Mark,
 we are having problems with X11 slowly leaking memory, which then leads to a 
 system crash. Do you see anything similar?
 My attempts at using valgrind have been inconclusive (if not confusing) up 
 to now...
 
 Cheers,
  Sergio
 
 On 27 Apr 2010, at 22:54, Mark Stodola wrote:
 
  
 Hey everyone,
 
 I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
 circumstances, I'm not in a position to upgrade them to the latest 5.x with 
 ease.  Lately I've been having trouble with systems locking up hard that 
 are running an nvidia card using the 190.42 or 195.36.15 proprietary 
 drivers.  Dual monitors connected via DVI, twinview.
 
 I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying 
 success.  The Quadro seems to have lasted about a month before locking, 
 while the 9600GT is much more often, daily/weekly.  I'm running the stock 
 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 
 1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  I'm 
 having no luck capturing log data or kdump data.
 
 The strange part is, having identical hardware in several locations, only 
 some experience the issue.
 
 Hardware:
 Intel DG43NB motherboards (bios revision doesn't seem to matter at this 
 point, running 98,99,104, or 105)
 ^- hardware revision is the same for all of them: AAE34877-402
 Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G 
 seagates
 Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
 Single stick, 1GB DDR2 (800) memory
 PS/2 Keyboard/mouse
 
 I'm curious if anyone else has run into similar problems such as this, and 
 if they have found a solution.  I'm looking at trying the 185.18.31 
 drivers, which seem to be certified for linux by a few software vendors, 
 according to nvidia's website.
 
 What driver versions and/or card make/models are people using successfully? 
  Any help or pointers are greatly appreciated.
 
 As I said, not all of them are misbehaving, and I have several with the 
 same config minus the video card running fine on SL5.2 and Windows XP Pro 
 (SP3).
 
 Getting desperate,
 Mark
 
 -- 
 Mr. Mark V. Stodola
 Digital Systems Engineer
 
 National Electrostatics Corp.
 P.O. Box 620310
 Middleton, WI 53562-0310 USA
 Phone: (608) 831-7600
 Fax: (608) 831-9591

 
  
 
 
 -- 
 Mr. Mark V. Stodola
 Digital Systems Engineer
 
 National Electrostatics Corp.
 P.O. Box 620310
 Middleton, WI 53562-0310 USA
 Phone: (608) 831-7600
 Fax: (608) 831-9591
 

-- 
 Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
 University of Johannesburg, Physics Department
 ATLAS TDAQ sysadmin group - Office:75240 OnCall:164851


RE: Nvidia woes...

2010-04-27 Thread Laski, Michael
Mark,

I had a problem like that with a (now decommissioned) SL5.3 box with a GeForce 
FX5000 series card.  I seem to recall that after installing the nVidia 190.53 
drivers, the issue disappeared.  Under 190.42, the machine would randomly lock 
up then reboot--it was really frustrating.  I never really found a cause since  
updating the drivers made the issue go away.

Good luck!

-Mike


-Original Message-
From: owner-scientific-linux-us...@listserv.fnal.gov 
[mailto:owner-scientific-linux-us...@listserv.fnal.gov] On Behalf Of Mark 
Stodola
Sent: Tuesday, April 27, 2010 4:55 PM
To: SCIENTIFIC-LINUX-USERS@listserv.fnal.gov
Subject: Nvidia woes...

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.  Due to the 
circumstances, I'm not in a position to upgrade them to the latest 5.x 
with ease.  Lately I've been having trouble with systems locking up hard 
that are running an nvidia card using the 190.42 or 195.36.15 
proprietary drivers.  Dual monitors connected via DVI, twinview.

I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying 
success.  The Quadro seems to have lasted about a month before locking, 
while the 9600GT is much more often, daily/weekly.  I'm running the 
stock 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 
1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  
I'm having no luck capturing log data or kdump data.

The strange part is, having identical hardware in several locations, 
only some experience the issue.

Hardware:
Intel DG43NB motherboards (bios revision doesn't seem to matter at this 
point, running 98,99,104, or 105)
^- hardware revision is the same for all of them: AAE34877-402
Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G 
seagates
Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
Single stick, 1GB DDR2 (800) memory
PS/2 Keyboard/mouse

I'm curious if anyone else has run into similar problems such as this, 
and if they have found a solution.  I'm looking at trying the 185.18.31 
drivers, which seem to be certified for linux by a few software 
vendors, according to nvidia's website.

What driver versions and/or card make/models are people using 
successfully?  Any help or pointers are greatly appreciated.

As I said, not all of them are misbehaving, and I have several with the 
same config minus the video card running fine on SL5.2 and Windows XP 
Pro (SP3).

Getting desperate,
Mark

-- 
Mr. Mark V. Stodola
Digital Systems Engineer

National Electrostatics Corp.
P.O. Box 620310
Middleton, WI 53562-0310 USA
Phone: (608) 831-7600
Fax: (608) 831-9591