Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:
 Hey everyone,

 I currently have deployed a number of SL 5.2 i386 machines.  Due to the
 circumstances, I'm not in a position to upgrade them to the latest 5.x with
 ease.  Lately I've been having trouble with systems locking up hard that are
 running an nvidia card using the 190.42 or 195.36.15 proprietary drivers.
  Dual monitors connected via DVI, twinview.

 I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying
 success.  The Quadro seems to have lasted about a month before locking,
 while the 9600GT is much more often, daily/weekly.  I'm running the stock
 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server
 1.1.1-48.41.el5_2.1).  The systems are generally idle when it happens.  I'm
 having no luck capturing log data or kdump data.

 The strange part is, having identical hardware in several locations, only
 some experience the issue.

 Hardware:
 Intel DG43NB motherboards (bios revision doesn't seem to matter at this
 point, running 98,99,104, or 105)
 ^- hardware revision is the same for all of them: AAE34877-402
 Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G
 seagates
 Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI
 Single stick, 1GB DDR2 (800) memory
 PS/2 Keyboard/mouse

 I'm curious if anyone else has run into similar problems such as this, and
 if they have found a solution.  I'm looking at trying the 185.18.31 drivers,
 which seem to be certified for linux by a few software vendors, according
 to nvidia's website.

 What driver versions and/or card make/models are people using successfully?
  Any help or pointers are greatly appreciated.

 As I said, not all of them are misbehaving, and I have several with the same
 config minus the video card running fine on SL5.2 and Windows XP Pro (SP3).

 Getting desperate,
 Mark

Mark,

I don't use nVidia graphics cards and also should mention my
connection to the ELRepo Project up front but have you tried using
the kernel independent, kABI tracking kmod packages that the ELRepo
Project provides? [1]

There are three different packages available, kmod-nvidia [2],
kmod-nvidia-96xx [3] and kmod-nvidia-173xx [4].

If you would like to discuss the usage before trying any one of them,
there is a ELRepo users' mailing list [5] and, if there should be a
problem, the ELRepo bug tracker [6].

Regards,
Alan.

[1] http://elrepo.org
[2] http://elrepo.org/tiki/kmod-nvidia
[3] http://elrepo.org/tiki/kmod-nvidia-96xx
[4] http://elrepo.org/tiki/kmod-nvidia-173xx
[5] http://lists.elrepo.org/mailman/listinfo/elrepo
[6] http://elrepo.org/bugs/main_page.php


Re: Nvidia woes...

2010-04-28 Thread Jaroslaw Polok
Alan Bartlett wrote:

[...]
 I don't use nVidia graphics cards and also should mention my
 connection to the ELRepo Project up front but have you tried using
 the kernel independent, kABI tracking kmod packages that the ELRepo
 Project provides? [1]

We do use some NVIDIA's but mostly with 96xx legacy series
driver (however we may start using the 195 current series
on our future hardware):

So far we use nvidia packages we maintain ourselves but it would
be interesting for us to change this situation...

Speaking of which I'm little bit confused about nvidia kernel
modules and kABI: checking  http://dup.et.redhat.com/
and using kABI testing script gives following result on your

kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:

./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Red Hat Enterprise Linux 5 ABI Checker
--

ABI Checker version: 1.2

Module:./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
Kernel:2.6.18-194.el5
Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist

WARNING: The following symbols are used by your module
WARNING: and are not on the ABI whitelist.

symbol: acpi_walk_namespace
symbol: agp_bridges
symbol: acpi_get_handle
symbol: acpi_os_wait_events_complete
symbol: acpi_evaluate_object
symbol: acpi_bus_get_device
symbol: acpi_install_notify_handler
symbol: acpi_evaluate_integer
symbol: acpi_remove_notify_handler


Are you using a different ABI Checker script ?
.. since the version 1.2 does not seem to be happy
about nvidia kernel modules here 

Best Regards

Jarek

__
---
_ Jaroslaw_Polok __ CERN - IT/OIS/ODS _
_ http://home.cern.ch/~jpolok ___ tel_+41_22_767_1834 _
_ +41_78_792_0795 _


Re: [OT] Re: xorg-x11-fonts-ISO8859-1-75dpi breaks when using rpm

2010-04-28 Thread Garrett Holmstrom

On 4/27/2010 9:36, Tim Edwards wrote:

On 27/04/10 16:16, Faye Gibbins wrote:

Yes but we use the mdp devolved layer and I've asked if their repos are
yum enabled and they say no.

So unless my LM say's I can create a yum archive I'm not sure what else
I can do.


Can you get them to agree to at least temporarily let you use yum
against the official Scientificlinux repos on the web?

If not you're out of luck unfortunately. We used to have no access to
yum repositories from our DMZ machines and it was very painful getting
something installed with just rpm.

One tip though, assuming you have a machine with a working yum (your
desktop maybe?), is to do a 'yum whatprovides mkfontdir', where
mkfontdir is what it complains is missing. That way you can see exactly
which RPM is needed.


Is the problem that you can't use yum at all or that you just can't use 
yum for a specific set of rpms?  If it's the second then you could 
always try ``yum localinstall'' so yum can sort out the dependencies 
that *are* in SL's repos.


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:
 On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:
 Hey everyone,

 I currently have deployed a number of SL 5.2 i386 machines.

Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Re: Nvidia woes...

2010-04-28 Thread Steven Timm

On Wed, 28 Apr 2010, Alan Bartlett wrote:


On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:

On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series.


It is possible to update any SL5.x system to the latest kernel
that is available, so you should be able to get around that.
i.e. I am running kernels from the 5.4 series on 5.3 systems all the
time and did 5.3 on 5.2 also.
Steve



Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



--
--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.


Re: Nvidia woes...

2010-04-28 Thread Alan Bartlett
On 28 April 2010 15:41, Jaroslaw Polok jaroslaw.po...@cern.ch wrote:

 We do use some NVIDIA's but mostly with 96xx legacy series
 driver (however we may start using the 195 current series
 on our future hardware):

 So far we use nvidia packages we maintain ourselves but it would
 be interesting for us to change this situation...

Hello Jarek,

As long as you are using SL 5.3 or above, you should find that the
packages will fulfil your requirements. The current (as distinct from
the legacy) package is regularly rebuilt when nVidia releases a new
version.

 Speaking of which I'm little bit confused about nvidia kernel
 modules and kABI: checking  http://dup.et.redhat.com/
 and using kABI testing script gives following result on your

 kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko:

 ./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
 Red Hat Enterprise Linux 5 ABI Checker
 --

 ABI Checker version: 1.2

 Module:    ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko
 Kernel:    2.6.18-194.el5
 Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist

 WARNING: The following symbols are used by your module
 WARNING: and are not on the ABI whitelist.

 symbol: acpi_walk_namespace
 symbol: agp_bridges
 symbol: acpi_get_handle
 symbol: acpi_os_wait_events_complete
 symbol: acpi_evaluate_object
 symbol: acpi_bus_get_device
 symbol: acpi_install_notify_handler
 symbol: acpi_evaluate_integer
 symbol: acpi_remove_notify_handler

That is a known issue and as I am not one of the team maintaining the
nVidia packages, it would be best if I do not go into great details
but just refer you to the relevant bug tracker entries [1][2]. The
whole issue of the kernel ABI whitelist and the requirements of
certain kmod packages requiring non-listed symbols is something that
we have discussed with Jon Masters, of Red Hat.

Although I know that other members of the ELRepo Admin Team are
subscribed to this list, I think it might be best to transfer this
discussion to the ELRepo mailing list [3] where the more appropriate
audience can be found.

Regards,
Alan.

[1] http://elrepo.org/bugs/view.php?id=30
[2] https://bugzilla.redhat.com/show_bug.cgi?id=520891
[3] http://lists.elrepo.org/mailman/listinfo/elrepo


Re: Memory footprint on 64bit SL vs. 32bit

2010-04-28 Thread Stephan Wiesand

On Apr 27, 2010, at 00:15 , Brett Viren wrote:

 We recently started running our C++ analysis code on 64bit SL5.3 and
 have been surprised to find the memory usage is about 2x what we are
 used when running it on 32 bits.  Comparing a few basic applications
 like sleep(1) show similar memory usage.  Others, like sshd, show only a
 30% size increase (maybe that is subject to configuration differences
 between the two hosts).
 
 I understand that pointers must double in size but the bulk of our
 objects are made of ints and floats and these are 32/64 bit-invariant.
 I found[1] that poorly defined structs containing pointers can bloat
 even on non-pointer data members due the padding needed to keep
 everything properly aligned.  It would kind of surprise me if this is
 what is behind what we see.
 
 Does anyone have experience in understanding or maybe even combating
 this increase in a program's memory footprint when going to 64 bits?

Is it real or virtual memory usage that's increasing beyond expectations?

Example: glibc's locale handling code will behave quite differently in the 
64-bit case. In 32-bit mode, even virtual address space is a scarce resource, 
while in 64-bit mode it isn't. So in the latter case, they simply mmap the 
whole file providing the info for the locale in use, while in the former they 
use a small address window they slide to the appropriate position. The 64-bit 
case is simpler and thus probably less code, more robust and easier to 
maintain. And it's probably faster. The 32-bit case uses less *virtual* memory 
- but *real* memory usage is about the same, since only those pages actually 
read will ever be paged in. This has a dramatic effect on the VSZ of hello 
world in python. It does not on anything that really matters - in particular, 
checking the memory footprints of sleep  co. is not very useful because 
they're really small compared to typical HEP analysis apps anyway.

What are your actual figures?

 Thanks,
 -Brett.
 
 [1] http://www.codeproject.com/KB/winsdk/Optimization_64_bit.aspx#IDAJLKNC

-- 
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany



smime.p7s
Description: S/MIME cryptographic signature


Problem with latest pam_krb5

2010-04-28 Thread Steve Gaarder
A number of my computers upgraded themselves to pam_krb5-2.2.14-15, and remote 
logins promptly broke.  If you tried to ssh in, it would ask for your password 
and then close the connection.  The log file shows an enigmatic message (the 
 is the username):


account checks fail for 'x': unknown reason -1765328254 (Cannot read 
password)


If I put back pam_krb5-2.2.14-10, it works fine.

To make it even stranger, only some of the systems that did this update have 
this problem.


Any ideas?

Steve Gaarder
System Administrator, Dept of Mathematics
Cornell University, Ithaca, NY, USA
gaar...@math.cornell.edu


Re: Memory footprint on 64bit SL vs. 32bit

2010-04-28 Thread Brett Viren
Thanks Stephan and Peter,

Peter Elmer peter.el...@cern.ch writes:

 We are actually preparing some proposals/recommendations about
 measuring memory use, as in addition to this VSIZE/64bit confusion the
 introduction of multicore applications which share memory also
 misleads people...

This is interesting.  I didn't know about the nuances you two bring up.
Peter, can you send a link whenever your document is available?

Stephan, we have been looking at /proc/PID/status's VmSize and VIRT from
top which I think are the same.  

For our Gaudi/Geant4/ROOT/Python based job on 64bits we see a size of
about 1GB after initial loading including Geant4 data sets and the
geometry.  This then plateaus to an eventual 1.5GB as we encounter rarer
and rarer upward fluctuations in event size (our Boost pools based
memory manager only grows as needed, never shrinks).  On 32 bits I'm
used to seeing about 50% of these numbers.

I'll look into the suggestions you both gave.

Thanks,
-Brett.



smime.p7s
Description: S/MIME cryptographic signature


Re: Nvidia woes...

2010-04-28 Thread Phil Perry

Alan Bartlett wrote:

On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote:

On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote:

Hey everyone,

I currently have deployed a number of SL 5.2 i386 machines.


Mark,

Further to my earlier message, I have re-read your initial sentence above.

A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5
series. Unfortunately the kmod-nvidia[-*] packages that are available
from ELRepo, although being kABI tracking, will only weak-link back to
the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many
of the other packages, back to the original 2.6.18-8.x.y.el5 kernels.

So having raised your hopes, I now have to dash them. Sorry.

Regards,
Alan.



Correct for the current driver (kmod-nvidia) and 173 series 
(kmod-nvidia-173xx) legacy driver, but the older kmod-nvidia-96xx legacy 
driver is currently kABI compliant will all current EL5 kernels :)