Ya, Sorry, I have no answers for you, but you do have my sympathy.
I've had to do that kind of detective work before. Some times it is an oddly named file, a very very long-named file, or some times it's a file that somehow got a very bizarre date, like "Apr 15 1904". In a few cases it has also been hung NFS mounts somewhere in the path. I've had to drill down each of the subdir one after another just like you did to figure it out, because there was no filename or other hints in the schedule or error logs, just a generic failed message. Luckily I only have to do it about once or twice a year, but it is time consuming. Ben -----Original Message----- From: ADSM: Dist Stor Manager [mailto:[EMAIL PROTECTED] On Behalf Of Zoltan Forray/AC/VCU Sent: Friday, April 01, 2005 9:03 AM To: ADSM-L@VM.MARIST.EDU Subject: Re: Large Linux clients Thanks for the suggestion. However, this is not true. We already tried this. We did "find . | wc -l" to get the object count (1.1M) with no problems. But the backup still will not work. Constantly fails, in unpredictable/inconsistant places, with the same "Producer Thread" error. I spent 2+ days drilling through the various sub-directories (of this directory that causes the failures), one-by-one, and was able to backup 38 of the 40 subdirs, totalling over 980K objects, with out a problem. When I included these two other directories, in the same pile, the backup would fail. When I then went back and individually selected the sub-sub directories of these sub-directories (one at a time), I was able to backup *ALL* of the sub-sub directories, no problem. Then I went back and selected the upper-level directory and backed it up, no problem.. Let me draw a picture of the structure of these directories. The problem directories are in this directory: /coyote/dsk3/patients/prostateReOpt/Mount_0/ . If I try to backup the /Mount_0/ as a whole, crashes every time. If I point to sub-dirs below /Mount_0/ (40 of these - all with the same named 4-subsub dirs ), two of these cause a crash. I noted that these two both have >72K objects while the other 38 have less than 60K objects. Yet when I manually picked the 4-subsub dirs of the Patient_172 dir, the backup worked (sort of - see below). Same for the Patient_173. To really drive me crazy, the first attempt at backing up one of the subsub dirs under Patient_172, the backup crashed. Yet I could backup the other 3 with no issue. So, we started looking at the problem subdir and noticed a weird file name that ended in a tilde (~). When I excluded it, the backup ran. Then when I went back and picked just the file with the tilde, it backed up fine (my head is getting balder-and-balder !!). I then went back and re-selected the whole Patient_172 directory and it backed up (or at least scanned it since everything was backed-up) just fine !!!1 ARRRRRRRRRRRRGGGGGGHHHHHHHHHHHHH !! This is maddening and shows no rhyme-or-reason. Henk ten Have <[EMAIL PROTECTED]> Sent by: "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> 04/01/2005 08:29 AM Please respond to "ADSM: Dist Stor Manager" <ADSM-L@VM.MARIST.EDU> To ADSM-L@VM.MARIST.EDU cc Subject Re: [ADSM-L] Large Linux clients An old trick I used for many years: to investigate a "problem" filesystem, do a "find" in that filesystem. If the find dies, tsm definitly will die. I'll bet your find will die, and that's why your backup will die/hang or whatever also. A find will do a filestat on all files/dirs, actually the same the backup does. So your issue is OS related and not tsm. Cheers Henk () On Tuesday 29 March 2005 12:11, you wrote: > On Mar 29, 2005, at 12:37 PM, Zoltan Forray/AC/VCU wrote: > > ...However, then I try to backup the tree at the third-level (e.g. > > /coyote/dsk3/), the client pretty much siezes immediately and > > dsmerror.log says "B/A Txn Producer Thread, fatal error, Signal 11". > > The server shows the session as "SendW" and nothing going else going > > on.... > > Zoltan - > > Signal 11 is a segfault - a software failure. > The client programming has a defect, which may be incited by a problem > in that area of the file system (so have that investigated). A > segfault can be induced by memory constraint, which in this context > would most likely be Unix Resource Limits, so also enter the command > 'limit' in Linux csh or tcsh and potentially boost the stack size > ('unlimit stacksize'). This is to say that the client was probably > invoked under artificially limited environmentals. > > Richard Sims