Re: Network is ok, web browser won't browse

2013-04-25 Thread Andras Horvath
"traceroute -I domain.com" as root might help track down the problem.


On Thu, 25 Apr 2013 21:03:41 -0500
Nathan Moore  wrote:

> Context: small academic cluster of SL5.8 machines, x86_64.
> 
> Recently, some of the machines on the cluster stopped being able to
> access the outside internet.  Machines can still update via ftp, yum,
> nslookup, and ping, but http requests beyond the local cluster seem
> to be universally denied.  I've never encountered a problem like this
> before (and it is probably my own fault, as I am a self-taught
> sys-admin), but I'm wondering if anyone else has seen a similar
> problem.
> 
> Bonus points if you can tell me where in /etc  or /var I should look
> to find the offending config file.
> 
> regards,
> 
> NTM


Re: Network is ok, web browser won't browse

2013-04-25 Thread John Lauro
One possibility is a proxy issue. Configured when it shouldn't be, or the proxy 
server down, or not configured but an upstream firewall enforces you go through 
it, etc... 

Less likely, but sometimes a switch or router just acts up and needs to be 
reset. Assuming you have identically configured machines, and groups are acting 
different, makes it a little more likely... 

Have you tried different utilities for http, such as wget? 


- Original Message -

From: "Nathan Moore"  
To: "scientific-linux-users"  
Sent: Thursday, April 25, 2013 10:03:41 PM 
Subject: Network is ok, web browser won't browse 

Context: small academic cluster of SL5.8 machines, x86_64. 


Recently, some of the machines on the cluster stopped being able to access the 
outside internet. Machines can still update via ftp, yum, nslookup, and ping, 
but http requests beyond the local cluster seem to be universally denied. I've 
never encountered a problem like this before (and it is probably my own fault, 
as I am a self-taught sys-admin), but I'm wondering if anyone else has seen a 
similar problem. 


Bonus points if you can tell me where in /etc or /var I should look to find the 
offending config file. 


regards, 


NTM 





Re: Network is ok, web browser won't browse

2013-04-25 Thread zxq9

On 04/26/2013 11:03 AM, Nathan Moore wrote:

Context: small academic cluster of SL5.8 machines, x86_64.

Recently, some of the machines on the cluster stopped being able to
access the outside internet.  Machines can still update via ftp, yum,
nslookup, and ping, but http requests beyond the local cluster seem to
be universally denied.  I've never encountered a problem like this
before (and it is probably my own fault, as I am a self-taught
sys-admin), but I'm wondering if anyone else has seen a similar problem.

Bonus points if you can tell me where in /etc  or /var I should look to
find the offending config file.


Sounds more like a firewalling, port-specific routing or corporate 
netnanny problem if they can send/receive other forms of traffic.


Network is ok, web browser won't browse

2013-04-25 Thread Nathan Moore
Context: small academic cluster of SL5.8 machines, x86_64.

Recently, some of the machines on the cluster stopped being able to access
the outside internet.  Machines can still update via ftp, yum, nslookup,
and ping, but http requests beyond the local cluster seem to be universally
denied.  I've never encountered a problem like this before (and it is
probably my own fault, as I am a self-taught sys-admin), but I'm wondering
if anyone else has seen a similar problem.

Bonus points if you can tell me where in /etc  or /var I should look to
find the offending config file.

regards,

NTM


Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Joseph Areeda

On 04/25/2013 02:58 PM, Konstantin Olchanski wrote:

joe@george:~$ sudo smartctl -iA /dev/sdd
Model Family: SandForce Driven SSDs
Device Model: Corsair CSSD-F240GB2
Firmware Version: 2.0
User Capacity:240,057,409,536 bytes [240 GB]

171 Program_Fail_Count  0x0032   000   000   000 Old_age Always   - 
  1
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000 Old_age Offline  - 
  45
181 Program_Fail_Count  0x0032   000   000   000 Old_age Always   - 
  1

Your SSD is toast... Program_Fail_Count non zero is bad. Unexpected power loss 
count non zero is bad.

I agree, it's now two days with Einstein at Home running using 100% CPU 
and GPU with no crashes.


The symptoms were certainly strange.  To recap:

On average once or twice a day the system would just stop working with 
error messages occasionally and complete freeze most of the time.  A 
power cycle seemed to reset the clock and let me work for another 4 to 8 
hours.


Error messages included "too many files open", "i/o error",  "bus error" 
and probably more.


The SSD was my system disk but swap, /tmp, /usr were on spinning media.  
When it worked it did everything I tried, copy files and diff,  read all 
files, run SMART diagnostic tests.


The SSD was only marginally faster (judging by response time).  I don't 
think I'm going to replace it with another SSD.  However, I have nothing 
against SSD, almost 2 years of trouble free operation and a sample size 
of one does not lead me to any conclusions about them.


Thanks to the list, I now have more Linux diagnostic tools.

Joe


Re: IO performance regressions with kernel-2.6.32-358

2013-04-25 Thread Orion Poplawski

On 04/25/2013 05:32 PM, Bill Maidment wrote:

Hi again.
Just a quick note to say I've tried upgrading to the latest 
kernel-2.6.32-358.6.1.el6.x86_64 just in case the problem had been 
"accidentally" fixed.
However, swap is still used in preference to RAM during backups.
I've changed yum.conf to installonly_limit=5 so I don't lose my working 
kernel-2.6.32-279.22.1.el6.x86_64 on the next two kernel updates.
Anyone have any news on bug 949166 yet?


I'm afraid not.  Thanks for checking the latest kernel.


--
Orion Poplawski
Technical Manager 303-415-9701 x222
NWRA/CoRA DivisionFAX: 303-415-9702
3380 Mitchell Lane  or...@cora.nwra.com
Boulder, CO 80301  http://www.cora.nwra.com


RE: IO performance regressions with kernel-2.6.32-358

2013-04-25 Thread Bill Maidment
Hi again.
Just a quick note to say I've tried upgrading to the latest 
kernel-2.6.32-358.6.1.el6.x86_64 just in case the problem had been 
"accidentally" fixed.
However, swap is still used in preference to RAM during backups.
I've changed yum.conf to installonly_limit=5 so I don't lose my working 
kernel-2.6.32-279.22.1.el6.x86_64 on the next two kernel updates.
Anyone have any news on bug 949166 yet?


Regards
Bill Maidment
Maidment Enterprises Pty Ltd

N.B. Any email disclaimer in the original message has been carefully ignored as 
it has no meaning in law.
 
 
-Original message-
> From:Orion Poplawski mailto:or...@cora.nwra.com> >
> Sent: Sunday 14th April 2013 3:23
> To: Olivier Mauras   >
> Cc: scientific-linux-users@fnal.gov  
> Subject: Re: IO performance regressions with kernel-2.6.32-358
> 
> On 04/09/2013 12:33 AM, Olivier Mauras wrote:
> > Hello,
> >
> > I can't access the bug, what is the status ?
> 
> Interesting, wonder why they took it private.  Perhaps that is a good 
> sign :).  Unfortunately no comments yet.
> 
> > On 2013-04-06 16:44, Orion Poplawski wrote:
> >
> >> On 04/05/2013 07:32 PM, Bill Maidment wrote:
> >>> After switching back to the older kernel, I can confirm that the swap
> >>> usage has returned to normal, even during the backup time 6am-8am.
> >>> (See attached graph) I'm wondering if this has anything to do with
> >>> NUMA, which sometimes prefers to use swap rather than memory that is
> >>> slightly slower than the currently used memory. I am using a single
> >>> Phenom II 955 processor (quad core) on an ASUS M4N68T motherboard and
> >>> SL6.4 upgraded from 6.3
> >> Switching back fixed it for me too.  Max swap usage reduced from 100% to
> >> 25%, all dumps completed on time.  I suppose NUMA could be a factor,
> >> this is a Xeon L5520 system.  I haven't seen this large swap usage on my
> >> other older hardware, although I only have the one machine writing
> >> backups to local disk, remote machines send the backup via the network
> >> to this host.
> >>
> >> I've filedhttps://bugzilla.redhat.com/show_bug.cgi?id=949166
> >>
> 
> 
> -- 
> Orion Poplawski
> Technical Manager 303-415-9701 x222
> NWRA/CoRA DivisionFAX: 303-415-9702
> 3380 Mitchell Lane  or...@cora.nwra.com 
>  
> Boulder, CO 80301  http://www.cora.nwra.com 
>  
> 
> 



Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Konstantin Olchanski
>joe@george:~$ sudo smartctl -iA /dev/sdd
>Model Family: SandForce Driven SSDs
>Device Model: Corsair CSSD-F240GB2
>Firmware Version: 2.0
>User Capacity:240,057,409,536 bytes [240 GB]
>
>171 Program_Fail_Count  0x0032   000   000   000 Old_age Always   
> -   1
>174 Unexpect_Power_Loss_Ct  0x0030   000   000   000 Old_age Offline  
> -   45
>181 Program_Fail_Count  0x0032   000   000   000 Old_age Always   
> -   1

Your SSD is toast... Program_Fail_Count non zero is bad. Unexpected power loss 
count non zero is bad.

-- 
Konstantin Olchanski
Data Acquisition Systems: The Bytes Must Flow!
Email: olchansk-at-triumf-dot-ca
Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada


kernel 2.6.32-358.6.1.el6 updates removes openafs

2013-04-25 Thread Joseph Thomas Szep
Hi,

We tried to install the lastest kernel update kernel-2.6.32-358.6.1.el6 and due 
to
an odd chain of dependencies, all openafs packages were removed.

On our systems, we only keep 2 kernels (as opposed to the default of 3).  So we 
have
kernels:

# rpm -q kernel
kernel-2.6.32-279.19.1.el6.x86_64
kernel-2.6.32-358.2.1.el6.x86_64

and openafs packages:

openafs.x86_64   1.6.2-0.144.sl6
 @sl/6.2
openafs-client.x86_641.6.2-0.144.sl6
 @sl/6.2
openafs-krb5.x86_64  1.6.2-0.144.sl6
 @sl/6.2
openafs-module-tools.x86_64  1.6.2-0.144.sl6
 @sl-security/6.2
kmod-openafs.noarch  1.6.2-4.SL64.el6   
 @sl-security/6.2
kmod-openafs-279.x86_64  1.6.2-0.144.sl6.279
 @sl-security/6.2
kmod-openafs-358.x86_64  1.6.2-0.144.sl6.358.0.1
 @sl-security/6.2

When the kernel-2.6.32-358.6.1 packages install (due to installonly_limit=2 in 
our yum.conf), the
"279" kernel is removed and that triggers the removal of kmod-openafs-279.  
That seems to trigger
the removal of kmod-openafs and THAT triggers the removal of kmod-openafs-358. 

All this seems to trigger the removal of openafs-client and that leads to local 
afs-depend
packages to be removed (usrlocalITbin).  The output from a yum upgrade follows:

# yum update kernel
Loaded plugins: aliases, priorities, product-id, protectbase, 
refresh-packagekit, subscription-manager
Updating Red Hat repositories.
1427 packages excluded due to repository priority protections
0 packages excluded due to repository protections
Setting up Update Process
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:2.6.32-358.6.1.el6 will be installed
--> Processing Dependency: kernel-firmware >= 2.6.32-358.6.1.el6 for package: 
kernel-2.6.32-358.6.1.el6.x86_64
--> Running transaction check
---> Package kernel-firmware.noarch 0:2.6.32-358.2.1.el6 will be updated
---> Package kernel-firmware.noarch 0:2.6.32-358.6.1.el6 will be an update
--> Finished Dependency Resolution
--> Running transaction check
---> Package kernel.x86_64 0:2.6.32-279.19.1.el6 will be erased
--> Processing Dependency: kernel(do_settimeofday) = 0x5603cf43 for package: 
kmod-openafs-279-1.6.2-0.144.sl6.279.x86_64
--> Running transaction check
---> Package kmod-openafs-279.x86_64 0:1.6.2-0.144.sl6.279 will be erased
--> Processing Dependency: kmod-openafs-279 for package: 
kmod-openafs-1.6.2-4.SL64.el6.noarch
--> Running transaction check
---> Package kmod-openafs.noarch 0:1.6.2-4.SL64.el6 will be erased
--> Processing Dependency: openafs-kernel >= 1.6 for package: 
openafs-client-1.6.2-0.144.sl6.x86_64
--> Running transaction check
---> Package openafs-client.x86_64 0:1.6.2-0.144.sl6 will be erased
--> Processing Dependency: openafs-client >= 1.6 for package: 
kmod-openafs-358-1.6.2-0.144.sl6.358.0.1.x86_64
--> Processing Dependency: openafs-client for package: 
usrlocalITbin-6.0-el6.bucs.1.noarch
--> Running transaction check
---> Package kmod-openafs-358.x86_64 0:1.6.2-0.144.sl6.358.0.1 will be erased
---> Package usrlocalITbin.noarch 0:6.0-el6.bucs.1 will be erased
--> Finished Dependency Resolution

Dependencies Resolved


 Package Arch  Version  
  Repository   Size

Installing:
 kernel  x86_642.6.32-358.6.1.el6   
  sl-security  26 M
Removing:
 kernel  x86_642.6.32-279.19.1.el6  
  @sl-security/6.2113 M
Updating for dependencies:
 kernel-firmware noarch2.6.32-358.6.1.el6   
  sl-security  11 M
Removing for dependencies:
 kmod-openafsnoarch1.6.2-4.SL64.el6 
  @sl-security/6.20.0  
 kmod-openafs-279x86_641.6.2-0.144.sl6.279  
  @sl-security/6.21.3 M
 kmod-openafs-358x86_641.6.2-0.144.sl6.358.0.1  
  @sl-security/6.21.3 M
 openafs-client  x86_641.6.2-0.144.sl6  
  @sl/6.2 2.4 M
 usrlocalITbin   noarch6.0-el6.bucs.1   
  @cs/6.1 391  

Transaction Summary
==

Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Akemi Yagi
On Thu, Apr 25, 2013 at 1:02 PM, Joseph Areeda  wrote:

> Akemi,
> Here you go, please reply with you insight into what it means.
>
> Joe

As I said, I'm not an expert on SSD. :-)

> === START OF READ SMART DATA SECTION ===
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED
> WHEN_FAILED RAW_VALUE

> 231 SSD_Life_Left   0x0013   100   100   010Pre-fail  Always
> -   0

This one looks "interesting". Someone noted [1] that "A Corsair blog
post seems to show that the value never goes below 10% so I presume it
should be replaced at 10%."

Akemi

[1] 
http://serverfault.com/questions/316150/how-to-determine-number-of-write-cycles-or-expected-life-for-ssd-under-linux


Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Yasha Karant

On 04/25/2013 01:02 PM, Joseph Areeda wrote:


On 04/25/2013 12:50 PM, Akemi Yagi wrote:

On Thu, Apr 25, 2013 at 12:35 PM, Joseph Areeda  wrote:

On 04/25/2013 11:59 AM, Todd And Margo Chester wrote:

Just out of curiosity, what was the make and model
of the bad SSD?

Disk Uitility reports Corsair CSSD-F240GB2, I haven't copied all available
yet and removed it.

Joe

While you still have that SSD, could you show us the output returned by:

smartctl -iA /dev/sdX

(Please replace "X" appropriately)

I'm not an expert on SSD but just wonder what information can be
obtained on the use of your SSD.

Akemi

Akemi,
Here you go, please reply with you insight into what it means.

Joe

joe@george:~$ sudo smartctl -iA /dev/sdd
smartctl 5.43 2012-06-30 r3573
[x86_64-linux-2.6.32-358.6.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen,
http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: Corsair CSSD-F240GB2
Serial Number:1117650209970055
LU WWN Device Id: 5 00 00085
Firmware Version: 2.0
User Capacity:240,057,409,536 bytes [240 GB]
Sector Size:  512 bytes logical/physical
Device is:In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 6
Local Time is:Thu Apr 25 12:59:49 2013 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   110   100   050 Pre-fail
Always   -   0/31829019
   5 Retired_Block_Count 0x0033   100   100   003 Pre-fail
Always   -   0
   9 Power_On_Hours_and_Msec 0x0032   100   100   000 Old_age
Always   -   15095h+36m+58.580s
  12 Power_Cycle_Count   0x0032   100   100   000 Old_age
Always   -   139
171 Program_Fail_Count  0x0032   000   000   000 Old_age
Always   -   1
172 Erase_Fail_Count0x0032   000   000   000 Old_age
Always   -   0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000 Old_age
Offline  -   45
177 Wear_Range_Delta0x   000   000   000 Old_age
Offline  -   0
181 Program_Fail_Count  0x0032   000   000   000 Old_age
Always   -   1
182 Erase_Fail_Count0x0032   000   000   000 Old_age
Always   -   0
187 Reported_Uncorrect  0x0032   100   100   000 Old_age
Always   -   0
194 Temperature_Celsius 0x0022   029   051   000 Old_age
Always   -   29 (Min/Max 18/51)
195 ECC_Uncorr_Error_Count  0x001c   110   100   000 Old_age
Offline  -   0/31829019
196 Reallocated_Event_Count 0x0033   100   100   000 Pre-fail
Always   -   0
231 SSD_Life_Left   0x0013   100   100   010 Pre-fail
Always   -   0
233 SandForce_Internal  0x   000   000   000 Old_age
Offline  -   1344
234 SandForce_Internal  0x0032   000   000   000 Old_age
Always   -   640
241 Lifetime_Writes_GiB 0x0032   000   000   000 Old_age
Always   -   640
242 Lifetime_Reads_GiB  0x0032   000   000   000 Old_age
Always   -   3712





Without in any way promoting or endorsing a particular vendor, look at:

http://www.newegg.com/Product/Product.aspx?Item=N82E16820233126

and the reviews thereunder.

quote:  1 out of 5 eggs

Dead After Two Months

Pros: None

Cons: Unreliable right up until its death, a very expensive leason

Other Thoughts: At the price being charged for this item it should be 
flawless


Manufacturer Response:

We prefer flawless too but, due to the laws of physics, pretty much 
ANY product made with a semiconductors is going to have a minimum of a 
2% failure rate. And, that is why we have our warranty. Please RMA the 
drive and we'll replace it.


External Link(s):
Corsair Contact Page

Not my review.

Yasha Karant


Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Joseph Areeda


On 04/25/2013 12:50 PM, Akemi Yagi wrote:

On Thu, Apr 25, 2013 at 12:35 PM, Joseph Areeda  wrote:

On 04/25/2013 11:59 AM, Todd And Margo Chester wrote:

Just out of curiosity, what was the make and model
of the bad SSD?

Disk Uitility reports Corsair CSSD-F240GB2, I haven't copied all available
yet and removed it.

Joe

While you still have that SSD, could you show us the output returned by:

smartctl -iA /dev/sdX

(Please replace "X" appropriately)

I'm not an expert on SSD but just wonder what information can be
obtained on the use of your SSD.

Akemi

Akemi,
Here you go, please reply with you insight into what it means.

Joe

   joe@george:~$ sudo smartctl -iA /dev/sdd
   smartctl 5.43 2012-06-30 r3573
   [x86_64-linux-2.6.32-358.6.1.el6.x86_64] (local build)
   Copyright (C) 2002-12 by Bruce Allen,
   http://smartmontools.sourceforge.net

   === START OF INFORMATION SECTION ===
   Model Family: SandForce Driven SSDs
   Device Model: Corsair CSSD-F240GB2
   Serial Number:1117650209970055
   LU WWN Device Id: 5 00 00085
   Firmware Version: 2.0
   User Capacity:240,057,409,536 bytes [240 GB]
   Sector Size:  512 bytes logical/physical
   Device is:In smartctl database [for details use: -P show]
   ATA Version is:   8
   ATA Standard is:  ATA-8-ACS revision 6
   Local Time is:Thu Apr 25 12:59:49 2013 PDT
   SMART support is: Available - device has SMART capability.
   SMART support is: Enabled

   === START OF READ SMART DATA SECTION ===
   SMART Attributes Data Structure revision number: 10
   Vendor Specific SMART Attributes with Thresholds:
   ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE 
   UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   110   100   050 Pre-fail 
   Always   -   0/31829019
  5 Retired_Block_Count 0x0033   100   100   003 Pre-fail 
   Always   -   0
  9 Power_On_Hours_and_Msec 0x0032   100   100   000 Old_age  
   Always   -   15095h+36m+58.580s
 12 Power_Cycle_Count   0x0032   100   100   000 Old_age  
   Always   -   139
   171 Program_Fail_Count  0x0032   000   000   000 Old_age  
   Always   -   1
   172 Erase_Fail_Count0x0032   000   000   000 Old_age  
   Always   -   0
   174 Unexpect_Power_Loss_Ct  0x0030   000   000   000 Old_age  
   Offline  -   45
   177 Wear_Range_Delta0x   000   000   000 Old_age  
   Offline  -   0
   181 Program_Fail_Count  0x0032   000   000   000 Old_age  
   Always   -   1
   182 Erase_Fail_Count0x0032   000   000   000 Old_age  
   Always   -   0
   187 Reported_Uncorrect  0x0032   100   100   000 Old_age  
   Always   -   0
   194 Temperature_Celsius 0x0022   029   051   000 Old_age  
   Always   -   29 (Min/Max 18/51)
   195 ECC_Uncorr_Error_Count  0x001c   110   100   000 Old_age  
   Offline  -   0/31829019
   196 Reallocated_Event_Count 0x0033   100   100   000 Pre-fail 
   Always   -   0
   231 SSD_Life_Left   0x0013   100   100   010 Pre-fail 
   Always   -   0
   233 SandForce_Internal  0x   000   000   000 Old_age  
   Offline  -   1344
   234 SandForce_Internal  0x0032   000   000   000 Old_age  
   Always   -   640
   241 Lifetime_Writes_GiB 0x0032   000   000   000 Old_age  
   Always   -   640
   242 Lifetime_Reads_GiB  0x0032   000   000   000 Old_age  
   Always   -   3712






Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Akemi Yagi
On Thu, Apr 25, 2013 at 12:35 PM, Joseph Areeda  wrote:
> On 04/25/2013 11:59 AM, Todd And Margo Chester wrote:

>> Just out of curiosity, what was the make and model
>> of the bad SSD?
>
> Disk Uitility reports Corsair CSSD-F240GB2, I haven't copied all available
> yet and removed it.
>
> Joe

While you still have that SSD, could you show us the output returned by:

smartctl -iA /dev/sdX

(Please replace "X" appropriately)

I'm not an expert on SSD but just wonder what information can be
obtained on the use of your SSD.

Akemi


Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Joseph Areeda

On 04/25/2013 11:59 AM, Todd And Margo Chester wrote:

On 04/24/2013 07:07 PM, Joseph Areeda wrote:

so perhaps I can blame it on the SSD going bad.


Just out of curiosity, what was the make and model
of the bad SSD?
Disk Uitility reports Corsair CSSD-F240GB2, I haven't copied all 
available yet and removed it.


Joe


Re: Help finding a hardware problem (I think) now fixed maybe

2013-04-25 Thread Todd And Margo Chester

On 04/24/2013 07:07 PM, Joseph Areeda wrote:

so perhaps I can blame it on the SSD going bad.


Just out of curiosity, what was the make and model
of the bad SSD?


[SOLVED] Problems with kmod-nvidia and 6.4

2013-04-25 Thread Joseph Areeda
Well, I solved this issue and it's almost too embarrassing to discuss, 
but I learn more by broadcasting my ignorance than trying to hide it.


This system is a multi-boot (SL6 mainly but also Ubuntu and Windows) 
that I use to develop and test cross platform apps.


I prefer ubutnu's grub-mkconfig because I haven't found anything like it 
on SL.  It searches all partitions, mounted or not for bootable images 
and adds them to the grub boot menu.


So what happened was I started my process with yum update which updated 
the kernel to 2.6.32-358.6.1.el6.x86_64 but didn't update the grub 
config so it was booting the old kernel 2.6.32-358.2.1.el6.x86_64


This confused not only me but also the nvidia install scripts. Rerunning 
the grub-mkconfig script in Ubuntu allowed me to run the installation of 
kmod-nvidia, nvdia-x11-drv, nvidia-x11-drv-32bit without issue.


A very red-faced apology for the red-herring,

Joe



On 04/25/2013 07:51 AM, Joseph Areeda wrote:

On 04/25/2013 01:03 AM, John Pilkington wrote:

On 25/04/13 03:23, Joseph Areeda wrote:

Here I am again.

I have my system working well with nouveau drivers but I need to 
install

the nvidia drivers because I do do some GPU work.

I tried installing kmod-nvidia and nvidia-x11-drv from ELREPO but X
won't start.

BTW - if this happens to you it is very helpful to set the boot 
level to
3 so you can poke around.  Edit /etc/inittab and change the level in 
the

last line from 5 to 3.

I reread the thread from a few weeks ago on this problem but didn't see
any resolution.

The error messages that seem pertinent are:

 From /var/log/Xorg.0.log.old (I'm now running nouveau drivers)

[45.437] (II) LoadModule: "glx"
[45.449] (II) Loading
/usr/lib64/xorg/modules/extensions/nvidia/libglx.so
[46.197] (II) Module glx: vendor="NVIDIA Corporation"
[46.197]compiled for 4.0.2, module version = 1.0.0
[46.197]Module class: X.Org Server Extension
[46.197] (II) NVIDIA GLX Module  310.44  Wed Mar 27 15:10:55 
PDT

2013
[46.197] Loading extension GLX
[46.197] (II) LoadModule: "nvidia"
[46.206] (II) Loading 
/usr/lib64/xorg/modules/drivers/nvidia_drv.so

[46.269] (II) Module nvidia: vendor="NVIDIA Corporation"
[46.269]compiled for 4.0.2, module version = 1.0.0
[46.269]Module class: X.Org Video Driver
[46.581] (EE) NVIDIA: Failed to load the NVIDIA kernel module.
Please check your
[46.581] (EE) NVIDIA: system's kernel log for additional
error messages.
[46.581] (II) UnloadModule: "nvidia"
[46.581] (II) Unloading nvidia
[46.581] (EE) Failed to load module "nvidia" (module-specific
error, 0)
[46.581] (EE) No drivers available.

from /var/log/messages:

Apr 24 18:53:21 george kernel: nvidia: module license 'NVIDIA'
taints kernel.
Apr 24 18:53:21 george kernel: Disabling lock debugging due to
kernel taint
Apr 24 18:53:21 george kernel: NVRM: The NVIDIA probe routine was
not called for 1 device(s).
Apr 24 18:53:21 george kernel: NVRM: This can occur when a driver
such as nouveau, rivafb,
Apr 24 18:53:21 george kernel: NVRM: nvidiafb, or rivatv was loaded
and obtained ownership of
Apr 24 18:53:21 george kernel: NVRM: the NVIDIA device(s).
Apr 24 18:53:21 george kernel: NVRM: Try unloading the conflicting
kernel module (and/or
Apr 24 18:53:21 george kernel: NVRM: reconfigure your kernel 
without

the conflicting
Apr 24 18:53:21 george kernel: NVRM: driver(s)), then try loading
the NVIDIA kernel module
Apr 24 18:53:21 george kernel: NVRM: again.
Apr 24 18:53:21 george kernel: NVRM: No NVIDIA graphics adapter 
probed!


Checking /etc/modprobe.d (with the nvidia drivers installed) showed
those modules were either in bloacklist.conf or blacklist-nouveau.conf.

I also tried

dracut -f /boot/initramfs-$(uname -r).img $(uname -r)

which has worked in the past when blacklist wasn't really blacklisting.

I'm not sure I will be running 6.4 from now on.  It has not been tested
with the rpms from our group.

I'd appreciate any suggestions.

Joe

I've just had what looks like a similar problem on Fedora 17, solved 
by editing the 'files' section of xorg.conf.  Modules sourced by 
X.org were being loaded instead of those from nvidia. Here's an 
extract from the thread (304.64) on rpmfusion-users.



Section "Files"
#ModulePath "/usr/lib64/xorg/modules/extensions/nvidia"
  ModulePath   "/usr/lib64/nvidia/xorg"
#ModulePath   "/usr/lib64/xorg/modules/extensions"
ModulePath   "/usr/lib64/xorg/modules"
FontPath "/usr/share/fonts/default/Type1"
EndSection

as suggested here:

http://forums.fedoraforum.org/showthread.php?t=286084


John P
Thanks for the suggestion John but that doesn't seem to be my problem 
at least not exactly.


Reading the thread and following the suggestions did not solve my 
problem.  I bel

Re: Help finding a hardware problem (I think)

2013-04-25 Thread Yasha Karant

On 04/25/2013 07:48 AM, Jeff Siddall wrote:

On 04/25/2013 08:31 AM, Elias Persson wrote:

On 2013-04-24 19:34, Joseph Areeda wrote:

Thanks Jeff,

This does support my current hypothesis that the SSD I was mounting on /
is the most likely culprit.

What fun.

Joe

On 04/24/2013 10:27 AM, Jeff Siddall wrote:

On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:

disk utility show ... SMART [is] fine.
>

SMART "health report" is useless. I had dead disks report "SMART OK"
and perfectly functional disks report "SMART Failure, replace your
disk now".


Agreed.  SMART doesn't diagnose everything.

On the flaky drive I recently replaced smart extended offline tests
all passed as did the smart health assessment check. Nothing else
wrong either (no pending/offline uncorrectable or CRC errors).  But it
surely was not working well.

Jeff



badblocks might be useful?

http://en.wikipedia.org/wiki/Badblocks

You'd presumably want the "non-destructive" tests...


smartctl -t long is probably a better option.  If a small number of bad
blocks are detected they should be swapped out by the drive itself
meaning they are transparent to the FS.  You won't see any of that with
badblocks.

Jeff


Such blocks swapped out by the hardware controller built into the hard 
drive (the controller to which the computer hard drive interface 
controller communicates -- e.g., the SATA controller on a motherboard) 
might or might not be transparent.  Certain types of information are 
duplicated automatically on a file system and some on a disk by the 
hardware controller.  Those that are not would find that "chunk M", for 
some M, of a file composed of N chunks is defective (where a chunk 
depends on the specifics of a drive, typically a block), the information 
is not recoverable; and another chunk held in reserve by the hardware 
controller is substituted for chunk M, and still given the same 
location.  I can explain this mapping algorithm in greater detail if the 
reader is not familiar with the algorithm.
Thus, the total size of the file has not been lost, but the contents of 
the former chunk M are in fact destroyed.  This condition might not be 
reported as a bad block depending upon the internal error detection and 
correction methodology used by the file system implementation. 
However, if the information is in fact lost and an application attempts 
to use the information in the file, then an error will occur.  In the 
case of a non-critical video stream, this could be nothing more than 
some "black" pixels in the image that might not be detectable to the 
casual human viewer of the image.  In other cases, e.g., a text file, 
the loss might be obvious.


Yasha Karant


Re: Problems with kmod-nvidia and 6.4

2013-04-25 Thread Joseph Areeda

On 04/25/2013 01:03 AM, John Pilkington wrote:

On 25/04/13 03:23, Joseph Areeda wrote:

Here I am again.

I have my system working well with nouveau drivers but I need to install
the nvidia drivers because I do do some GPU work.

I tried installing kmod-nvidia and nvidia-x11-drv from ELREPO but X
won't start.

BTW - if this happens to you it is very helpful to set the boot level to
3 so you can poke around.  Edit /etc/inittab and change the level in the
last line from 5 to 3.

I reread the thread from a few weeks ago on this problem but didn't see
any resolution.

The error messages that seem pertinent are:

 From /var/log/Xorg.0.log.old (I'm now running nouveau drivers)

[45.437] (II) LoadModule: "glx"
[45.449] (II) Loading
/usr/lib64/xorg/modules/extensions/nvidia/libglx.so
[46.197] (II) Module glx: vendor="NVIDIA Corporation"
[46.197]compiled for 4.0.2, module version = 1.0.0
[46.197]Module class: X.Org Server Extension
[46.197] (II) NVIDIA GLX Module  310.44  Wed Mar 27 15:10:55 PDT
2013
[46.197] Loading extension GLX
[46.197] (II) LoadModule: "nvidia"
[46.206] (II) Loading 
/usr/lib64/xorg/modules/drivers/nvidia_drv.so

[46.269] (II) Module nvidia: vendor="NVIDIA Corporation"
[46.269]compiled for 4.0.2, module version = 1.0.0
[46.269]Module class: X.Org Video Driver
[46.581] (EE) NVIDIA: Failed to load the NVIDIA kernel module.
Please check your
[46.581] (EE) NVIDIA: system's kernel log for additional
error messages.
[46.581] (II) UnloadModule: "nvidia"
[46.581] (II) Unloading nvidia
[46.581] (EE) Failed to load module "nvidia" (module-specific
error, 0)
[46.581] (EE) No drivers available.

from /var/log/messages:

Apr 24 18:53:21 george kernel: nvidia: module license 'NVIDIA'
taints kernel.
Apr 24 18:53:21 george kernel: Disabling lock debugging due to
kernel taint
Apr 24 18:53:21 george kernel: NVRM: The NVIDIA probe routine was
not called for 1 device(s).
Apr 24 18:53:21 george kernel: NVRM: This can occur when a driver
such as nouveau, rivafb,
Apr 24 18:53:21 george kernel: NVRM: nvidiafb, or rivatv was loaded
and obtained ownership of
Apr 24 18:53:21 george kernel: NVRM: the NVIDIA device(s).
Apr 24 18:53:21 george kernel: NVRM: Try unloading the conflicting
kernel module (and/or
Apr 24 18:53:21 george kernel: NVRM: reconfigure your kernel without
the conflicting
Apr 24 18:53:21 george kernel: NVRM: driver(s)), then try loading
the NVIDIA kernel module
Apr 24 18:53:21 george kernel: NVRM: again.
Apr 24 18:53:21 george kernel: NVRM: No NVIDIA graphics adapter 
probed!


Checking /etc/modprobe.d (with the nvidia drivers installed) showed
those modules were either in bloacklist.conf or blacklist-nouveau.conf.

I also tried

dracut -f /boot/initramfs-$(uname -r).img $(uname -r)

which has worked in the past when blacklist wasn't really blacklisting.

I'm not sure I will be running 6.4 from now on.  It has not been tested
with the rpms from our group.

I'd appreciate any suggestions.

Joe

I've just had what looks like a similar problem on Fedora 17, solved 
by editing the 'files' section of xorg.conf.  Modules sourced by X.org 
were being loaded instead of those from nvidia. Here's an extract from 
the thread (304.64) on rpmfusion-users.



Section "Files"
#ModulePath   "/usr/lib64/xorg/modules/extensions/nvidia"
  ModulePath   "/usr/lib64/nvidia/xorg"
#ModulePath   "/usr/lib64/xorg/modules/extensions"
ModulePath   "/usr/lib64/xorg/modules"
FontPath "/usr/share/fonts/default/Type1"
EndSection

as suggested here:

http://forums.fedoraforum.org/showthread.php?t=286084


John P
Thanks for the suggestion John but that doesn't seem to be my problem at 
least not exactly.


Reading the thread and following the suggestions did not solve my 
problem.  I believe that refers to a packaging issue with 304.64 and the 
current ELREPO version is 310.44.  /etc/X11/xorg.conf was using 
different paths and changing them to the ones suggested gave the same 
errors.


I don't seem to be able to go back 304 because I need the 32 bit drivers 
also and those seem to only be available for the current version.  I 
wish I didn't need to install 32 bit drivers but that's not my program 
that uses them.


My next straw to grasp is to try the drivers from nvidia even though I 
hate to do it.  I had an issue a couple of years ago doing that on 
ubuntu.  Seems their packagers put files in a different location than 
nvidia.  So mixing the deb package and the run download ended up with 
mismatched dynamic libraries which was a whole lotta fun tracking down.  
I mention that to remind myself what a bad idea it is to mix the two.


I'll report back if that works.

Joe


Re: Help finding a hardware problem (I think)

2013-04-25 Thread Jeff Siddall

On 04/25/2013 08:31 AM, Elias Persson wrote:

On 2013-04-24 19:34, Joseph Areeda wrote:

Thanks Jeff,

This does support my current hypothesis that the SSD I was mounting on /
is the most likely culprit.

What fun.

Joe

On 04/24/2013 10:27 AM, Jeff Siddall wrote:

On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:

disk utility show ... SMART [is] fine.
>

SMART "health report" is useless. I had dead disks report "SMART OK"
and perfectly functional disks report "SMART Failure, replace your
disk now".


Agreed.  SMART doesn't diagnose everything.

On the flaky drive I recently replaced smart extended offline tests
all passed as did the smart health assessment check. Nothing else
wrong either (no pending/offline uncorrectable or CRC errors).  But it
surely was not working well.

Jeff



badblocks might be useful?

http://en.wikipedia.org/wiki/Badblocks

You'd presumably want the "non-destructive" tests...


smartctl -t long is probably a better option.  If a small number of bad 
blocks are detected they should be swapped out by the drive itself 
meaning they are transparent to the FS.  You won't see any of that with 
badblocks.


Jeff


Re: Help finding a hardware problem (I think)

2013-04-25 Thread Elias Persson

On 2013-04-24 19:34, Joseph Areeda wrote:

Thanks Jeff,

This does support my current hypothesis that the SSD I was mounting on /
is the most likely culprit.

What fun.

Joe

On 04/24/2013 10:27 AM, Jeff Siddall wrote:

On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:

disk utility show ... SMART [is] fine.
>

SMART "health report" is useless. I had dead disks report "SMART OK"
and perfectly functional disks report "SMART Failure, replace your
disk now".


Agreed.  SMART doesn't diagnose everything.

On the flaky drive I recently replaced smart extended offline tests
all passed as did the smart health assessment check. Nothing else
wrong either (no pending/offline uncorrectable or CRC errors).  But it
surely was not working well.

Jeff



badblocks might be useful?

http://en.wikipedia.org/wiki/Badblocks

You'd presumably want the "non-destructive" tests...


RE: Where is version 3.2.9 of the environment-modules RPM

2013-04-25 Thread Ree, Jan-Albert van
My bad :(
I've been using an incorrect URL in my Spacewalk setup:

$mirror/scientificlinux/6.2/x86_64/updates/security/

Which path can I best mirror (my baseline in Spacewalk is SL 6.2) to make sure 
I have all the latest stuff? What I do need to avoid is pulling in any 
sl-release packages, as upgrading those breaks Spacewalk...
--
Jan-Albert van Ree


Jan-Albert van Ree
Linux System Administrator
MSuG MARIN Support Group
E mailto:j.a.v@marin.nl
T +31 317 49 35 48

MARIN
2, Haagsteeg, P.O. Box 28, 6700 AA Wageningen, The Netherlands
T +31 317 49 39 11, F +31 317 49 32 45, I www.marin.nl


From: Dr Andrew C Aitchison [a.c.aitchi...@dpmms.cam.ac.uk]
Sent: Thursday, April 25, 2013 11:02
To: Ree, Jan-Albert van
Cc: SCIENTIFIC-LINUX-USERS@FNAL.GOV
Subject: Re: Where is version 3.2.9 of the environment-modules RPM

On Thu, 25 Apr 2013, Ree, Jan-Albert van wrote:

> A while ago we discovered a bug in environment-modules-3.2.7 and reported 
> this upstream.
> RedHat responded with an update, environment-modules-3.2.9c-4.el6.src.rpm
>
> However this RPM hasn't reached the SL6 repositories yet. Can anybody 
> indicate when this fix will become available?

Hm. They are in SL6.4 and 6rolling, eg
http://ftp.scientificlinux.org/linux/scientific/6.4/SRPMS/vendor/environment-modules-3.2.9c-4.el6.src.rpm
http://ftp.scientificlinux.org/linux/scientific/6rolling/SRPMS/vendor/environment-modules-3.2.9c-4.el6.src.rpm

--
Dr. Andrew C. Aitchison Computer Officer, DPMMS, Cambridge
a.c.aitchi...@dpmms.cam.ac.uk   http://www.dpmms.cam.ac.uk/~werdna


Re: Where is version 3.2.9 of the environment-modules RPM

2013-04-25 Thread Dr Andrew C Aitchison

On Thu, 25 Apr 2013, Ree, Jan-Albert van wrote:


A while ago we discovered a bug in environment-modules-3.2.7 and reported this 
upstream.
RedHat responded with an update, environment-modules-3.2.9c-4.el6.src.rpm

However this RPM hasn't reached the SL6 repositories yet. Can anybody indicate 
when this fix will become available?


Hm. They are in SL6.4 and 6rolling, eg
http://ftp.scientificlinux.org/linux/scientific/6.4/SRPMS/vendor/environment-modules-3.2.9c-4.el6.src.rpm
http://ftp.scientificlinux.org/linux/scientific/6rolling/SRPMS/vendor/environment-modules-3.2.9c-4.el6.src.rpm

--
Dr. Andrew C. Aitchison Computer Officer, DPMMS, Cambridge
a.c.aitchi...@dpmms.cam.ac.uk   http://www.dpmms.cam.ac.uk/~werdna


Re: Problems with kmod-nvidia and 6.4

2013-04-25 Thread John Pilkington

On 25/04/13 03:23, Joseph Areeda wrote:

Here I am again.

I have my system working well with nouveau drivers but I need to install
the nvidia drivers because I do do some GPU work.

I tried installing kmod-nvidia and nvidia-x11-drv from ELREPO but X
won't start.

BTW - if this happens to you it is very helpful to set the boot level to
3 so you can poke around.  Edit /etc/inittab and change the level in the
last line from 5 to 3.

I reread the thread from a few weeks ago on this problem but didn't see
any resolution.

The error messages that seem pertinent are:

 From /var/log/Xorg.0.log.old (I'm now running nouveau drivers)

[45.437] (II) LoadModule: "glx"
[45.449] (II) Loading
/usr/lib64/xorg/modules/extensions/nvidia/libglx.so
[46.197] (II) Module glx: vendor="NVIDIA Corporation"
[46.197]compiled for 4.0.2, module version = 1.0.0
[46.197]Module class: X.Org Server Extension
[46.197] (II) NVIDIA GLX Module  310.44  Wed Mar 27 15:10:55 PDT
2013
[46.197] Loading extension GLX
[46.197] (II) LoadModule: "nvidia"
[46.206] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[46.269] (II) Module nvidia: vendor="NVIDIA Corporation"
[46.269]compiled for 4.0.2, module version = 1.0.0
[46.269]Module class: X.Org Video Driver
[46.581] (EE) NVIDIA: Failed to load the NVIDIA kernel module.
Please check your
[46.581] (EE) NVIDIA: system's kernel log for additional
error messages.
[46.581] (II) UnloadModule: "nvidia"
[46.581] (II) Unloading nvidia
[46.581] (EE) Failed to load module "nvidia" (module-specific
error, 0)
[46.581] (EE) No drivers available.

from /var/log/messages:

Apr 24 18:53:21 george kernel: nvidia: module license 'NVIDIA'
taints kernel.
Apr 24 18:53:21 george kernel: Disabling lock debugging due to
kernel taint
Apr 24 18:53:21 george kernel: NVRM: The NVIDIA probe routine was
not called for 1 device(s).
Apr 24 18:53:21 george kernel: NVRM: This can occur when a driver
such as nouveau, rivafb,
Apr 24 18:53:21 george kernel: NVRM: nvidiafb, or rivatv was loaded
and obtained ownership of
Apr 24 18:53:21 george kernel: NVRM: the NVIDIA device(s).
Apr 24 18:53:21 george kernel: NVRM: Try unloading the conflicting
kernel module (and/or
Apr 24 18:53:21 george kernel: NVRM: reconfigure your kernel without
the conflicting
Apr 24 18:53:21 george kernel: NVRM: driver(s)), then try loading
the NVIDIA kernel module
Apr 24 18:53:21 george kernel: NVRM: again.
Apr 24 18:53:21 george kernel: NVRM: No NVIDIA graphics adapter probed!

Checking /etc/modprobe.d (with the nvidia drivers installed) showed
those modules were either in bloacklist.conf or blacklist-nouveau.conf.

I also tried

dracut -f /boot/initramfs-$(uname -r).img $(uname -r)

which has worked in the past when blacklist wasn't really blacklisting.

I'm not sure I will be running 6.4 from now on.  It has not been tested
with the rpms from our group.

I'd appreciate any suggestions.

Joe

I've just had what looks like a similar problem on Fedora 17, solved by 
editing the 'files' section of xorg.conf.  Modules sourced by X.org were 
being loaded instead of those from nvidia.  Here's an extract from the 
thread (304.64) on rpmfusion-users.



Section "Files"
#ModulePath   "/usr/lib64/xorg/modules/extensions/nvidia"
  ModulePath   "/usr/lib64/nvidia/xorg"
#ModulePath   "/usr/lib64/xorg/modules/extensions"
ModulePath   "/usr/lib64/xorg/modules"
FontPath "/usr/share/fonts/default/Type1"
EndSection

as suggested here:

http://forums.fedoraforum.org/showthread.php?t=286084


John P


Where is version 3.2.9 of the environment-modules RPM

2013-04-25 Thread Ree, Jan-Albert van
A while ago we discovered a bug in environment-modules-3.2.7 and reported this 
upstream.
RedHat responded with an update, environment-modules-3.2.9c-4.el6.src.rpm

However this RPM hasn't reached the SL6 repositories yet. Can anybody indicate 
when this fix will become available?

(the issue was that if any module you loaded used MANPATH, your MANPATH would 
be corrupted for the remainder of the session, even if you unloaded all modules)

Thanks,
--
Jan-Albert van Ree


Jan-Albert van Ree
Linux System Administrator
MSuG MARIN Support Group
E mailto:j.a.v@marin.nl
T +31 317 49 35 48

MARIN
2, Haagsteeg, P.O. Box 28, 6700 AA Wageningen, The Netherlands
T +31 317 49 39 11, F +31 317 49 32 45, I www.marin.nl