Re: Nvidia woes...
On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote: Hey everyone, I currently have deployed a number of SL 5.2 i386 machines. Due to the circumstances, I'm not in a position to upgrade them to the latest 5.x with ease. Lately I've been having trouble with systems locking up hard that are running an nvidia card using the 190.42 or 195.36.15 proprietary drivers. Dual monitors connected via DVI, twinview. I've tried a GeForce 9600GT as well as a Quadro NVS 290 with varying success. The Quadro seems to have lasted about a month before locking, while the 9600GT is much more often, daily/weekly. I'm running the stock 5.2 kernel (2.6.18-92.1.6.el5) and Xorg (xorg-x11-server 1.1.1-48.41.el5_2.1). The systems are generally idle when it happens. I'm having no luck capturing log data or kdump data. The strange part is, having identical hardware in several locations, only some experience the issue. Hardware: Intel DG43NB motherboards (bios revision doesn't seem to matter at this point, running 98,99,104, or 105) ^- hardware revision is the same for all of them: AAE34877-402 Areca ARC-1200 SATA RAID card (latest firmware, 1.48), running 2 320G seagates Additional PCI-e NIC, Intel PRO/1000, running e1000e v0.4.1.12-NAPI Single stick, 1GB DDR2 (800) memory PS/2 Keyboard/mouse I'm curious if anyone else has run into similar problems such as this, and if they have found a solution. I'm looking at trying the 185.18.31 drivers, which seem to be certified for linux by a few software vendors, according to nvidia's website. What driver versions and/or card make/models are people using successfully? Any help or pointers are greatly appreciated. As I said, not all of them are misbehaving, and I have several with the same config minus the video card running fine on SL5.2 and Windows XP Pro (SP3). Getting desperate, Mark Mark, I don't use nVidia graphics cards and also should mention my connection to the ELRepo Project up front but have you tried using the kernel independent, kABI tracking kmod packages that the ELRepo Project provides? [1] There are three different packages available, kmod-nvidia [2], kmod-nvidia-96xx [3] and kmod-nvidia-173xx [4]. If you would like to discuss the usage before trying any one of them, there is a ELRepo users' mailing list [5] and, if there should be a problem, the ELRepo bug tracker [6]. Regards, Alan. [1] http://elrepo.org [2] http://elrepo.org/tiki/kmod-nvidia [3] http://elrepo.org/tiki/kmod-nvidia-96xx [4] http://elrepo.org/tiki/kmod-nvidia-173xx [5] http://lists.elrepo.org/mailman/listinfo/elrepo [6] http://elrepo.org/bugs/main_page.php
Re: Nvidia woes...
Alan Bartlett wrote: [...] I don't use nVidia graphics cards and also should mention my connection to the ELRepo Project up front but have you tried using the kernel independent, kABI tracking kmod packages that the ELRepo Project provides? [1] We do use some NVIDIA's but mostly with 96xx legacy series driver (however we may start using the 195 current series on our future hardware): So far we use nvidia packages we maintain ourselves but it would be interesting for us to change this situation... Speaking of which I'm little bit confused about nvidia kernel modules and kABI: checking http://dup.et.redhat.com/ and using kABI testing script gives following result on your kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko: ./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko Red Hat Enterprise Linux 5 ABI Checker -- ABI Checker version: 1.2 Module:./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko Kernel:2.6.18-194.el5 Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist WARNING: The following symbols are used by your module WARNING: and are not on the ABI whitelist. symbol: acpi_walk_namespace symbol: agp_bridges symbol: acpi_get_handle symbol: acpi_os_wait_events_complete symbol: acpi_evaluate_object symbol: acpi_bus_get_device symbol: acpi_install_notify_handler symbol: acpi_evaluate_integer symbol: acpi_remove_notify_handler Are you using a different ABI Checker script ? .. since the version 1.2 does not seem to be happy about nvidia kernel modules here Best Regards Jarek __ --- _ Jaroslaw_Polok __ CERN - IT/OIS/ODS _ _ http://home.cern.ch/~jpolok ___ tel_+41_22_767_1834 _ _ +41_78_792_0795 _
Re: [OT] Re: xorg-x11-fonts-ISO8859-1-75dpi breaks when using rpm
On 4/27/2010 9:36, Tim Edwards wrote: On 27/04/10 16:16, Faye Gibbins wrote: Yes but we use the mdp devolved layer and I've asked if their repos are yum enabled and they say no. So unless my LM say's I can create a yum archive I'm not sure what else I can do. Can you get them to agree to at least temporarily let you use yum against the official Scientificlinux repos on the web? If not you're out of luck unfortunately. We used to have no access to yum repositories from our DMZ machines and it was very painful getting something installed with just rpm. One tip though, assuming you have a machine with a working yum (your desktop maybe?), is to do a 'yum whatprovides mkfontdir', where mkfontdir is what it complains is missing. That way you can see exactly which RPM is needed. Is the problem that you can't use yum at all or that you just can't use yum for a specific set of rpms? If it's the second then you could always try ``yum localinstall'' so yum can sort out the dependencies that *are* in SL's repos.
Re: Nvidia woes...
On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote: On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote: Hey everyone, I currently have deployed a number of SL 5.2 i386 machines. Mark, Further to my earlier message, I have re-read your initial sentence above. A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5 series. Unfortunately the kmod-nvidia[-*] packages that are available from ELRepo, although being kABI tracking, will only weak-link back to the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many of the other packages, back to the original 2.6.18-8.x.y.el5 kernels. So having raised your hopes, I now have to dash them. Sorry. Regards, Alan.
Re: Nvidia woes...
On Wed, 28 Apr 2010, Alan Bartlett wrote: On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote: On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote: Hey everyone, I currently have deployed a number of SL 5.2 i386 machines. Mark, Further to my earlier message, I have re-read your initial sentence above. A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5 series. It is possible to update any SL5.x system to the latest kernel that is available, so you should be able to get around that. i.e. I am running kernels from the 5.4 series on 5.3 systems all the time and did 5.3 on 5.2 also. Steve Unfortunately the kmod-nvidia[-*] packages that are available from ELRepo, although being kABI tracking, will only weak-link back to the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many of the other packages, back to the original 2.6.18-8.x.y.el5 kernels. So having raised your hopes, I now have to dash them. Sorry. Regards, Alan. -- -- Steven C. Timm, Ph.D (630) 840-8525 t...@fnal.gov http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.
Re: Nvidia woes...
On 28 April 2010 15:41, Jaroslaw Polok jaroslaw.po...@cern.ch wrote: We do use some NVIDIA's but mostly with 96xx legacy series driver (however we may start using the 195 current series on our future hardware): So far we use nvidia packages we maintain ourselves but it would be interesting for us to change this situation... Hello Jarek, As long as you are using SL 5.3 or above, you should find that the packages will fulfil your requirements. The current (as distinct from the legacy) package is regularly rebuilt when nVidia releases a new version. Speaking of which I'm little bit confused about nvidia kernel modules and kABI: checking http://dup.et.redhat.com/ and using kABI testing script gives following result on your kmod-nvidia-190.53-1.el5.elrepo.x86_64.rpm, nvidia.ko: ./abi_check.py ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko Red Hat Enterprise Linux 5 ABI Checker -- ABI Checker version: 1.2 Module: ./lib/modules/2.6.18-164.el5/extra/nvidia/nvidia.ko Kernel: 2.6.18-194.el5 Whitelist: /usr/src/kernels/2.6.18-194.el5-x86_64/kabi_whitelist WARNING: The following symbols are used by your module WARNING: and are not on the ABI whitelist. symbol: acpi_walk_namespace symbol: agp_bridges symbol: acpi_get_handle symbol: acpi_os_wait_events_complete symbol: acpi_evaluate_object symbol: acpi_bus_get_device symbol: acpi_install_notify_handler symbol: acpi_evaluate_integer symbol: acpi_remove_notify_handler That is a known issue and as I am not one of the team maintaining the nVidia packages, it would be best if I do not go into great details but just refer you to the relevant bug tracker entries [1][2]. The whole issue of the kernel ABI whitelist and the requirements of certain kmod packages requiring non-listed symbols is something that we have discussed with Jon Masters, of Red Hat. Although I know that other members of the ELRepo Admin Team are subscribed to this list, I think it might be best to transfer this discussion to the ELRepo mailing list [3] where the more appropriate audience can be found. Regards, Alan. [1] http://elrepo.org/bugs/view.php?id=30 [2] https://bugzilla.redhat.com/show_bug.cgi?id=520891 [3] http://lists.elrepo.org/mailman/listinfo/elrepo
Re: Memory footprint on 64bit SL vs. 32bit
On Apr 27, 2010, at 00:15 , Brett Viren wrote: We recently started running our C++ analysis code on 64bit SL5.3 and have been surprised to find the memory usage is about 2x what we are used when running it on 32 bits. Comparing a few basic applications like sleep(1) show similar memory usage. Others, like sshd, show only a 30% size increase (maybe that is subject to configuration differences between the two hosts). I understand that pointers must double in size but the bulk of our objects are made of ints and floats and these are 32/64 bit-invariant. I found[1] that poorly defined structs containing pointers can bloat even on non-pointer data members due the padding needed to keep everything properly aligned. It would kind of surprise me if this is what is behind what we see. Does anyone have experience in understanding or maybe even combating this increase in a program's memory footprint when going to 64 bits? Is it real or virtual memory usage that's increasing beyond expectations? Example: glibc's locale handling code will behave quite differently in the 64-bit case. In 32-bit mode, even virtual address space is a scarce resource, while in 64-bit mode it isn't. So in the latter case, they simply mmap the whole file providing the info for the locale in use, while in the former they use a small address window they slide to the appropriate position. The 64-bit case is simpler and thus probably less code, more robust and easier to maintain. And it's probably faster. The 32-bit case uses less *virtual* memory - but *real* memory usage is about the same, since only those pages actually read will ever be paged in. This has a dramatic effect on the VSZ of hello world in python. It does not on anything that really matters - in particular, checking the memory footprints of sleep co. is not very useful because they're really small compared to typical HEP analysis apps anyway. What are your actual figures? Thanks, -Brett. [1] http://www.codeproject.com/KB/winsdk/Optimization_64_bit.aspx#IDAJLKNC -- Stephan Wiesand DESY -DV- Platanenenallee 6 15738 Zeuthen, Germany smime.p7s Description: S/MIME cryptographic signature
Problem with latest pam_krb5
A number of my computers upgraded themselves to pam_krb5-2.2.14-15, and remote logins promptly broke. If you tried to ssh in, it would ask for your password and then close the connection. The log file shows an enigmatic message (the is the username): account checks fail for 'x': unknown reason -1765328254 (Cannot read password) If I put back pam_krb5-2.2.14-10, it works fine. To make it even stranger, only some of the systems that did this update have this problem. Any ideas? Steve Gaarder System Administrator, Dept of Mathematics Cornell University, Ithaca, NY, USA gaar...@math.cornell.edu
Re: Memory footprint on 64bit SL vs. 32bit
Thanks Stephan and Peter, Peter Elmer peter.el...@cern.ch writes: We are actually preparing some proposals/recommendations about measuring memory use, as in addition to this VSIZE/64bit confusion the introduction of multicore applications which share memory also misleads people... This is interesting. I didn't know about the nuances you two bring up. Peter, can you send a link whenever your document is available? Stephan, we have been looking at /proc/PID/status's VmSize and VIRT from top which I think are the same. For our Gaudi/Geant4/ROOT/Python based job on 64bits we see a size of about 1GB after initial loading including Geant4 data sets and the geometry. This then plateaus to an eventual 1.5GB as we encounter rarer and rarer upward fluctuations in event size (our Boost pools based memory manager only grows as needed, never shrinks). On 32 bits I'm used to seeing about 50% of these numbers. I'll look into the suggestions you both gave. Thanks, -Brett. smime.p7s Description: S/MIME cryptographic signature
Re: Nvidia woes...
Alan Bartlett wrote: On 28 April 2010 14:10, Alan Bartlett a...@elrepo.org wrote: On 27 April 2010 21:54, Mark Stodola stod...@pelletron.com wrote: Hey everyone, I currently have deployed a number of SL 5.2 i386 machines. Mark, Further to my earlier message, I have re-read your initial sentence above. A SL 5.2 system will be using a kernel from the 2.6.18-92.x.y.el5 series. Unfortunately the kmod-nvidia[-*] packages that are available from ELRepo, although being kABI tracking, will only weak-link back to the 2.6.18-128.x.y.el5 kernel series (i.e. SL 5.3) and not, like many of the other packages, back to the original 2.6.18-8.x.y.el5 kernels. So having raised your hopes, I now have to dash them. Sorry. Regards, Alan. Correct for the current driver (kmod-nvidia) and 173 series (kmod-nvidia-173xx) legacy driver, but the older kmod-nvidia-96xx legacy driver is currently kABI compliant will all current EL5 kernels :)