Re: increase softint_bytes
On Tue, Nov 21, 2017 at 03:45:26PM +0900, Masanobu SAITOH wrote: > 0) Apply the following change to -current. > > > Index: kern_softint.c > === > RCS file: /cvsroot/src/sys/kern/kern_softint.c,v > retrieving revision 1.43 > diff -u -p -r1.43 kern_softint.c > --- kern_softint.c4 Jul 2016 04:20:14 - 1.43 > +++ kern_softint.c21 Nov 2017 06:41:35 - > @@ -217,7 +217,7 @@ typedef struct softcpu { > static void softint_thread(void *); > -u_intsoftint_bytes = 8192; > +u_intsoftint_bytes = 16384; > u_intsoftint_timing; > static u_int softint_max; > static kmutex_t softint_lock; > > > 1) Sent the pullup request to netbsd-8 > > 2) Write auto-resize code and commit. > > 3) If it's stable, send the pullup request to netbsd-8. > > > OK? Sounds like a great plan! Martin
Re: increase softint_bytes
On 2017/11/20 17:28, Masanobu SAITOH wrote: On 2017/11/17 18:42, 6b...@6bone.informatik.uni-leipzig.de wrote: On Thu, 16 Nov 2017, Masanobu SAITOH wrote: Hi, all. Some device drivers now allocate a lot of softints. See: http://mail-index.netbsd.org/current-users/2017/11/09/msg032581.html To avoid this panic, I wrote the following patch: http://www.netbsd.org/~msaitoh/softint-20171116-0.dif I tested the patch. Now the dump comes in another place. https://suse.uni-leipzig.de/crash/crash-with-patch.jpg Regards Uwe Could you test the following patch? http://www.netbsd.org/~msaitoh/vlan-20171120-0.dif Updated patch http://www.netbsd.org/~msaitoh/vlan-20171121-0.dif Fix compile error (sorry) Revert if_wmreg.h 1.104 and if_wm.c 1.542 -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: increase softint_bytes
can't this be fixed by making it dynamic? It's not easy because the return value of softint_establish() is made from this area's address. As you know, the value is keep by each driver. I'm sorry. I misread kern_softint.c. The return value is not directly point to the area, but it's offset of the area, so it would be easy to resize it. I'll try to modify the code to do auto-resize. It'll take a little time to write this change. And, it's low level and important code, so it will take a time to test the stability before sending pullup request. So, 0) Apply the following change to -current. Index: kern_softint.c === RCS file: /cvsroot/src/sys/kern/kern_softint.c,v retrieving revision 1.43 diff -u -p -r1.43 kern_softint.c --- kern_softint.c 4 Jul 2016 04:20:14 - 1.43 +++ kern_softint.c 21 Nov 2017 06:41:35 - @@ -217,7 +217,7 @@ typedef struct softcpu { static void softint_thread(void *); -u_int softint_bytes = 8192; +u_int softint_bytes = 16384; u_int softint_timing; static u_int softint_max; static kmutex_tsoftint_lock; 1) Sent the pullup request to netbsd-8 2) Write auto-resize code and commit. 3) If it's stable, send the pullup request to netbsd-8. OK? -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
Re: FFS corruption
>>> After this migration, the filesystem started corrupting the same >>> file /usr/pkg/etc/httpd/httpd.conf at the same time, which happens >>> to be /etc/daily end of execution. >> Which filesystem? The original or the copy? > The copy. It would blow my mind if making a copy of a VM could break > the original. Strictly speaking, yes. But you not only copied the data off the source VM - which, yes, should not break anything - you also wrote it to the destination. I can imagine cases where that writing is what caused the trouble. Writing data to the destination might do this if, for example, the destination is another VM ultimately backed by the same spindle and there is something wrong with a driver that causes it to confuse nominally-distinct disk blocks with one another (like my possibility 2). >> [With Xen] that there's at least one more layer of mapping between >> OS sector numbers and hardware sector numbers > Indeed; the domU filesystem is a file in the dom0 filesystem. Hm. That does make it more difficult to come up with a plausible failure mode to explain this. Have you tried creating another file in the dom0 filesystem of the same size with easily identifiable content, to see if any of that content appears in the affected domU filesystem? dholland's identification of the overwrite data as inodes certainly does feel provocative, but I'm not sure what to make of it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: FFS corruption
Mouse wrote: > > After this migration, the filesystem started corrupting the same file > > /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be > > /etc/daily end of execution. > > Which filesystem? The original or the copy? The copy. It would blow my mind if making a copy of a VM could break the original. > [With Xen] that there's at least one more layer of mapping between > OS sector numbers and hardware sector numbers Indeed; the domU filesystem is a file in the dom0 filesystem. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: FFS corruption
> Everything was fine until I scp it over the network on a new machine. > After this migration, the filesystem started corrupting the same file > /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be > /etc/daily end of execution. Which filesystem? The original or the copy? > httpd.conf metadata did not change, its content was just [filled] > with some fixed length binary records (sample included below, in case > it rings a bell to someone). Setting immutable flags did not prevent > the corruption; And using ktrace on /etc/daily showed it did not > touch httpd.conf nor even its parent directory. > And fsck did not [find] anything wrong. Is there anything ringing a > bell to someone here? Any explanation? Offhand, this sounds like one of two things: (1) The same piece of disk is being used by two filesystems at once, and that just happens to be the place where both filesystems actually _use_ overlapping pieces of disk (if a filesystem is mostly empty, most of the space it's nominally using can be scribbled on without corrupting the filesystem; two mostly empty filesystems nominally using overlapping areas of disk might end up almost never both actually depending on the same sectors). (2) Somewhere in the data path for disk writes, the high bits of the disk block numbers are getting lost, thereby directing writes to two nominally different pieces of disk to the same sectors. This could be a software bug or a hardware issue (which could be a hardware bug, a software bug, or a case of incompatibility). As a simple example that probably is not what's going on in your case, a SCSI driver that doesn't know how to use 10-byte CDBs can end up redirecting sectors above the 1G point back onto the same sectors as others that are below the 1G point. You mentioned that at least one of these machines was a Xen instance. I don't know enough about Xen to do more than guess here, but it does mean that there's at least one more layer of mapping between OS sector numbers and hardware sector numbers, and thus at least one more layer where two supposedly different pieces of disk could get mapped to the same real sectors. Those additional layers are also additional places where the sort of botch outlined in (2) could strike. I realize this isn't very helpful, but it's about all that comes to mind that explains your observations. In particular, the metadata not changing, the immutable flag making no difference, ktrace showing no accesses - those all, to me, point to something corrupting the disk behind the OS's back. It could be either of the above, or perhaps even broken disk firmware, though that strikes me as unlikely compared to the above. fsck noticing nothing wrong probably just means that the only thing that got hit was data blocks. Hit a metadata block (inode table, superblock, etc) instead and fsck should get upset, but if all you're damaging is data blocks, fsck shouldn't care. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: FFS corruption
On Mon, Nov 20, 2017 at 08:09:28AM +, Emmanuel Dreyfus wrote: > 80 81 01 00 00 00 00 00 00 00 00 00 00 00 00 00 > || > 0010 5a 8d 0e 5a 60 8e 09 0f 5a 8d 0e 5a 60 8e 09 0f > |Z..Z`...Z..Z`...| > 0020 5a 8d 0e 5a 60 8e 09 0f 00 00 00 00 00 00 00 00 > |Z..Z`...| > 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > || > * That's an inode. -- David A. Holland dholl...@netbsd.org
Re: FFS corruption
On Mon, Nov 20, 2017 at 01:33:44PM +, Christos Zoulas wrote: > I think if the block allocation fails in a bad spot on ffsv2, fsck does > not correct it, so new file allocation from those blocks will fail. But that happened on FFSv1, and on an existig file. -- Emmanuel Dreyfus m...@netbsd.org
Re: FFS corruption
In article <20171120103643.gh4...@trav.math.uni-bonn.de>, Edgar Fuß wrote: >> Is there anything ringing a bell to someone here? >Yes, but I guess that doesn't help. >I experienced something remotely similar after a disc firmware crash followed >by a mpt(4) lockup (before I wrote the timeout recovery buhrow@ committed). >I would get a "mangled directory" panic on the same directory again and again; >fsck repaired it but found nothing else. I was just short of >dump/newfs/restore, but then something (I guess removing that directory) >helped. That was on a FFSv2, though. > >> Any explanation? >No. Only that apparently, an FFS can be inconsistent in a way fsck doesn't >recognize. I think if the block allocation fails in a bad spot on ffsv2, fsck does not correct it, so new file allocation from those blocks will fail. christos
Re: FFS corruption
> Is there anything ringing a bell to someone here? Yes, but I guess that doesn't help. I experienced something remotely similar after a disc firmware crash followed by a mpt(4) lockup (before I wrote the timeout recovery buhrow@ committed). I would get a "mangled directory" panic on the same directory again and again; fsck repaired it but found nothing else. I was just short of dump/newfs/restore, but then something (I guess removing that directory) helped. That was on a FFSv2, though. > Any explanation? No. Only that apparently, an FFS can be inconsistent in a way fsck doesn't recognize.
Re: increase softint_bytes
On 2017/11/17 18:42, 6b...@6bone.informatik.uni-leipzig.de wrote: On Thu, 16 Nov 2017, Masanobu SAITOH wrote: Hi, all. Some device drivers now allocate a lot of softints. See: http://mail-index.netbsd.org/current-users/2017/11/09/msg032581.html To avoid this panic, I wrote the following patch: http://www.netbsd.org/~msaitoh/softint-20171116-0.dif I tested the patch. Now the dump comes in another place. https://suse.uni-leipzig.de/crash/crash-with-patch.jpg Regards Uwe Could you test the following patch? http://www.netbsd.org/~msaitoh/vlan-20171120-0.dif -- --- SAITOH Masanobu (msai...@execsw.org msai...@netbsd.org)
FFS corruption
Hello I experienced some nasty FFS corrupton, which was only resolved by reformatting.L The filesystem was a root partition image for a Xen NetBSD-7.1/i386 domU. It was formatetd FFSv1 level 4. Everything was fine until I scp it over the network on a new machine. After this migration, the filesystem started corrupting the same file /usr/pkg/etc/httpd/httpd.conf at the same time, which happens to be /etc/daily end of execution. Other files were affected, ncluding a /usr/pkg/etc/httpd/httpd.conf.bak set there to recover, but it is difficult to asset the span of the problem. I assume few files were touched because the machine could still work after just restoring httpd.conf httpd.conf metadata did not change, its content was just filed with some fixed length binary records (sample included below, in case it rings a bell to someone). Setting immutable flags did not prevent the corruption; And using ktrace on /etc/daily showed it did not touch httpd.conf nor even its parent directory. And fsck did not fid anything wrong. Is there anything ringing a bell to someone here? Any explanation? Corrupted httpd.conf sample: 80 81 01 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0010 5a 8d 0e 5a 60 8e 09 0f 5a 8d 0e 5a 60 8e 09 0f |Z..Z`...Z..Z`...| 0020 5a 8d 0e 5a 60 8e 09 0f 00 00 00 00 00 00 00 00 |Z..Z`...| 0030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 0060 00 00 00 00 00 00 00 00 00 00 00 00 f4 4e a9 72 |.N.r| 0070 90 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0080 80 81 01 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0090 7f 8d 0e 5a 7e 77 dc 25 7f 8d 0e 5a 7e 77 dc 25 |...Z~w.%...Z~w.%| 00a0 7f 8d 0e 5a 7e 77 dc 25 00 00 00 00 00 00 00 00 |...Z~w.%| 00b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 00e0 00 00 00 00 00 00 00 00 00 00 00 00 8b 73 ed 77 |.s.w| 00f0 90 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0100 80 81 01 00 00 00 00 00 00 00 00 00 00 00 00 00 || 0110 92 8d 0e 5a 54 43 af 15 92 8d 0e 5a 54 43 af 15 |...ZTC.ZTC..| 0120 92 8d 0e 5a 54 43 af 15 00 00 00 00 00 00 00 00 |...ZTC..| 0130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * -- Emmanuel Dreyfus m...@netbsd.org