> On Jul 12, 2015, at 5:26 PM, Derek Yarnell <de...@umiacs.umd.edu> wrote: > > On 7/12/15 3:21 PM, Günther Alka wrote: >> First action: >> If you can mount the pool read-only, update your backup > > We are securing all the non-scratch data currently before messing with > the pool any more. We had backups as recent as the night before but it > is still going to be faster to pull the current data from the readonly > pool than from backups. > >> Then >> I would expect that a single bad disk is the reason of the problem on a >> write command. I would first check the system and fault log or >> smartvalues for hints about a bad disk. If there is a suspicious disk, >> remove that and retry a regular import. > > We have pulled all disks individually yesterday to test this exact > theory. We have hit the mpt_sas disk failure panics before so we had > already tried this.
I don't believe this is a bad disk. Some additional block pointer verification code was added in changeset f63ab3d5a84a12b474655fc7e700db3efba6c4c9 and likely is the cause of this assertion. In general, assertion failures are almost always software problems -- the programmer didn't see what they expected. Dan, if you're listening, Matt would be the best person to weigh-in on this. -- richard > >> If there is no hint >> Next what I would try is a pool export. Then create a script that >> imports the pool followed by a scrub cancel. (Hope that the cancel is >> faster than the crash). Then check logs during some pool activity. > > If I have not imported the pool RW can I export the pool? I thought we > have tried this but I will have to confer. > >> If this does not help, I would remove all data disks and bootup. >> Then hot-plug disk by disk and check if its detected properly and check >> logs. Your pool remains offline until enough disks come back. >> Adding disk by disk and checking logs should help to find a bad disk >> that initiates a crash > > This is interesting and we will try this once we secure the data. > >> Next option is, try a pool import where always one or next disk is >> missing. Until there is no write, missing disks are not a problem with >> ZFS (you may need to clear errors). > > Wouldn't this be the same as above hot-plugging disk by disk? > >> Last option: >> use another server where you try to import (mainboard, power, hba or >> backplane problem) remove all disks and do a nondestructive or smart >> test on another machine > > Sadly we do not have a spare chassis with 40 slots around to test this. > I am so far unconvinced that this is a hardware problem though. > > We will most likely boot up into linux live CD to run smartctl and see > if it has any information on the disks. > > -- > Derek T. Yarnell > University of Maryland > Institute for Advanced Computer Studies > _______________________________________________ > OmniOS-discuss mailing list > OmniOS-discuss@lists.omniti.com > http://lists.omniti.com/mailman/listinfo/omnios-discuss _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss